Hindi 101

I’m taking Hindi 101 this semester. The Devangari script feels mildly ornate in my hand compared to the angularity of alphabets descended from the Phoenician script (including the English alphabet), but it is quite lovely and not as challenging as I had imagined. It is still an alphabet, after all, with a much closer sound-grapheme correspondence than one finds in English, where each letter—particularly vowels—can correspond to multiple phonemes. (English grammar is absurdly simple compared to all other major languages, but our spelling system must be a nightmare for foreign learners. There’s something to be said for language academies that control the drift between pronunciation and spelling.) Devanagari does, however, omit some vowel sounds and uses secondary or “dependent” vowel forms in most contexts, so it has something of the syllabary about it. In fact, the biggest mistake I make in class is to confuse two dependent vowels,  ी and  ो. The former is long “ee”, the latter is “o”, but in certain fonts (including my own handwriting), they look nearly identical.

The script’s biconsonantal conjuncts are mostly intuitive, though a few bizarre ones need to be memorized as separate graphemes. We have conjuncts in English, but I believe they are a relatively new innovation with limited usage. One example is the city logo of Huntington Beach, California. Hindi has a lot of these, and they are quite common.

clip_image002_0001.201144028_std

An English biconsonantal conjunct.

Apart from learning a new script, the most enjoyable part of Hindi class has been coming across Romance or Germanic cognates. At an intellectual level, I know and have long known that Hindi and English, both Indo-European languages, share a genetic ancestry, which means that at some point in the distant past all Indo-European speakers spoke the same language. It’s easy to get a handle on the concept when talking about Romance languages: Spanish, Italian, and French all used to be Latin. There, we have a well documented history, stretching back through the Renaissance and middle ages to the familiar  world of Rome. However, when it comes to Proto Indo-European, we are faced with a deeper and wider canyon of time and an ancient world that is mostly unknown to us. The PIE speakers were probably living in the Pontic-Caspian steppe lands, but some evidence suggests that they may have been living in the greater Anatolian region; perhaps the most direct descendants of Proto Indo-Europeans are today’s Armenians, Turks, and Persians. They apparently kicked ass and took names because Indo European now stretches from the Pacific to the Indian Oceans.

But whoever they were, the PIE speakers are remote in a way that the Romans or Germanic tribes are not. Yet while doing my Hindi homework, every now and again I come across a word that clearly indicates the ancient linguistic (and genetic) connectedness between the Romans, the Germans, and the Hindi speakers. Kamiz for shirt; mez for table; kamra for room; mata for mother; pita for father; nam for name; darvaza for door . . . In Hindi class, when I say a word out loud that is clearly related to a European word, I am intoning sounds close to the ones that came from the lips of those ancient Indo-Europeans before they split eastward and westward to conquer Eurasia. To language nerds like me, it’s a chilling sensation.

Elliot Rodger’s Manifesto: Text Networks and Corpus Features

Analyzing manifestos is becoming a theme at this blog. Click here for Chris Dorner’s manifesto and here for the Unabomber manifesto.

Manifestos are interesting because they are the most deliberately written and deliberately personal of genres. It’s tenuous to make claims about a person’s psyche based on the linguistic features of his personal emails; it’s far less tenuous to make claims about a person’s psyche based on the linguistic features of his manifesto—especially one written right before he goes on a kill rampage. This one—“My Twisted World,” written by omega male Elliot Rodger—is 140 pages long, and is part manifesto, part autobiography.

I’ve made a lot of text networks over the years—of manifestos, of novels, of poems. Never before have I seen such a long text exhibit this kind of stark, binary division:

RodgersBetweennessCentrality

This network visualizes the nodes with the highest betweenness centrality. The lower, light blue cluster is Elliot’s domestic language; this is where you’ll find words like “friends”, “school,” “house,” et cetera . . . words describing his life in general. The higher, red cluster is Elliot’s sexually frustrated language; this is where you’ll find words like “girls,” “women,” “sex,” “experience,” “beautiful,” “never”  . . . words describing his relationships with (or lack thereof) the feminine half of our species.

It’s quite startling. Although this text is part manifesto and part autobiography, I wasn’t expecting such a clear division: the language Elliot uses to describe his sexually frustrated life is almost wholly severed from the language he uses to describe his life apart from the sex and the “girls” (Elliot uses “girls” far more frequently than he uses “women”—see below). It’s as though Elliot had completely compartmentalized his sexual frustration, and was keeping it at bay. Or trying to. I don’t know how this plays out in individual sections of the manifesto. Nor do I know what it says about Elliot’s mental health more generally. I’ve always believed that compartmentalizing frustrations is, contra popular advice, a rather healthy thing to do. I expected a very, very tortuous and conflicted network to emerge here, indicating that each aspect of Elliot’s life was dripping with sexual angst and misogyny. Not so, it turns out.

Here’s a brief “zoom” on each section:

RodgersDegreeCentralityDomestic

RodgersDegreeCentralityWomen

In the large, zoomed-out network—the first one in the post—notice that the most central nodes are “me” and “my.” I processed the text using AutoMap but decided to retain the pronouns, curious how the feminine, masculine, and personal pronouns would play out in the networks and the dispersion plots. Feminine, masculine, personal—not just pronouns in this particular text. And what emerges when the pronouns are retained is an obvious image of the Personal. Rodgers’ manifesto is brimming with self-reference:

RodgersPronouns

Take that with a grain of salt, of course. In making claims about any text with these methods, one should compare features with the features of general text corpora and with texts of a similar type. The Brown Corpus provides some perspective: “It” is the most frequent pronoun in that corpus; “I” is second; “me” is far down the list, past the third-person pronouns.

Here’s another narcissistic twist, found in the most frequent words in the text. Again,  pronouns have been retained. (Click to enlarge.)

RodgersFreqWords

“I” is the most frequent word in the entire text, coming before even the basic functional workhorses of the English language. The Brown Corpus once more provides perspective: “I” is the 11th most frequent word in that general corpus. Of course, as noted, there is an auto-biographic ethos to this manifesto, so it would be worth checking whether or not other auto-biographies bump “I” to the number one spot. Perhaps. But I would be surprised if “I,” “me,” and “my” all clustered in the top 10 in a typical auto-biography—a narcissistic genre by design, yet I imagine that self-aware authors attempt to balance the “I” with a pro-social dose of “thou.” Maybe I’m wrong. It would be worth checking.

More lexical dispersion plots . . .

Much more negation is seen below then is typically found in texts. According to Michael Halliday, most text corpora will exhibit 10% negative polarity and 90% positive polarity. Elliot’s manifesto, however, bursts with negation. Also notice, below, the constant references to “mother” and “father”—his parents are central characters. But not “mom” and “dad.” I’m from Southern California, born and raised, with social experience across the races and classes, but I’ve never heard a single English-only speaker refer to parents as “mother” and “father” instead of “mom” and “dad.” Was Elliot bilingual? Finally, note that Elliot prefers “girl/s” to “woman/en.”

RodgersGirlsGuys

RodgersMotherFather

RodgersNegation

RodgersSexEtc

Until I discover that auto-biographical texts always drip with personal pronouns, I would argue that Elliot’s manifesto is the product of an especially narcissistic personality. The boy couldn’t go two sentences without referencing himself in some way.

And what about the misogyny? He uses masculine pronouns as often as he uses feminine pronouns; he refers to his father as often as he refers to his mother—although, it is true, the references to mother become more frequent, relative to father, as Elliot pushes toward his misogynistic climax. Overall, however, the rhetorical energy in the text is not expended on females in particular. This is not an anti-woman screed from beginning to end. Also, recall, the preferred term is “girls,” not “women.” Elliot hated girls. Women—middle-aged, old, married, ensconced in careers, not apt to wear bikinis on the Santa Barbara beach—are hardly on Elliot’s radar. (This ageism also comes through in his YouTube videos.) Despite the “I hate all women” rhetorical flourishes at the very beginning and the very end of his manifesto, Elliot prefers to write about girls—young, blonde, unmarried, pre-career, in sororities, apt to wear bikinis on the Santa Barbara beach.

I noticed something similar in the Unabomber manifesto. Not about the girls. About the beginning and ending: what we remember most from that manifesto is its anti-PC bookends, even though the bulk of the manifesto devotes itself to very different subject matter. The quotes pulled from manifestos (including this one) and published by news outlets are a few subjective anecdotes, not the totality of the text .

Anyway. Pieces of writing that sally forth from such diseased individuals always call to mind what Kenneth Burke said about Mein Kampf:

[Hitler] was helpful enough to put his cards face up on the table, that we might examine his hands. Let us, then, for God’s sake, examine them.

 

Demographic distribution: Gender of citations in CCC, RSQ, and RR abstracts

This post follows up on my discussion of citation frequencies in abstracts in rhetoric and composition journals. To reiterate, a safe assumption to make is that citations in abstracts are “central” to the arguments presented and the research undertaken in the articles themselves; they are particularly informative about overall trends. The genre of the humanities article demands more citations than a core argument actually requires, so looking at citations in abstracts should control for that genre requirement, distilling down all citations to the most vital ones.

The journals: College Composition and Communication (CCC), Rhetoric Society Quarterly (RSQ), and Rhetoric Review (RR). The CCC abstracts run from February 2000 (51.3) to September 2011 (63.1), a total of 261 abstracts. The RSQ abstracts run from Winter 2000 (30.1) to Fall 2011 (41.5), a total of 220 abstracts. The RR abstracts run from 2002 (21.3) to 2011 (30.4), a total of 154 abstracts.

The previous post discussed the “long tail” distribution that emerged from the citation frequencies and what it means for disciplinary identity. This post presents information on the gender of the sources cited in the abstracts, then makes a few comments about demographic distributions in general.

There are 79 unique citations in the CCC abstracts; 159 unique citations in the RSQ abstracts; and 121 unique citations in the RR abstracts. (See previous post for .xls data files.) Here’s how the gender distribution falls: in CCC, 23 out of the 79 sources are female; in RSQ, 39 out of the 159 sources are female; in RR, 36 out of the 121 sources are female.

And here are graphs of the raw counts and of the percentages:

Abstract citations by gender (raw count)

Abstract citations by gender (raw count)

Abstract citations by gender (percentage)

Abstract citations by gender (percentage)

In Authoring a Discipline, Maureen Daly Goggin has shown that by 1990 total contributors to 9 of rhetoric and composition’s major journals—including the 3 analyzed here—had equalized to a nearly 50/50 split between males and females. I imagine this trend has continued into the new millennium, but it would be worthwhile to determine whether or not that’s the case.

What has not equalized, however, is the gender contribution in terms of citations. Odds are, counting all citations in the articles themselves would alleviate the large gap seen in the graphs above. But insofar as we accept that abstract citations represent the most vital sources in each journal, then an obvious gender gap still exists in CCC, RSQ, and RR citations.

In RSQ and RR, this gap, in part, likely has something to do with these journals’ tendencies to publish work on rhetorical history. I pointed this out in the last post: 27 (or 22%) of the RR citations are sources from the 17th century or earlier. 26 (or 16%) of RSQ citations are from the same period. Those numbers would grow if they included figures from the 18th and 19th centuries, as well. The reality is, most of these historical sources are male: Plato, Cicero, Aristotle, Quintilian, et cetera.

I have no ready explanation for why CCC citations should have as large a gender gap as the other journals’ citations, given that CCC builds most of its scholarship on sources from the middle part of the 20th century or later. If we look at the 102 most cited figures in CCC between 1987 and 2011 (Mueller, “Grasping”), we discover that 43/102 (42%) of the sources are female: a gender imbalance, but one not nearly as pronounced as the imbalance that surfaces in abstract citations. I’d be curious to see the gender distribution in Mueller’s entire data set. Is there a nearly 50/50 split between male and female sources across all citations in CCC between 1987 and 2011? If so, we could model the gender imbalance in this journal’s citations as an emergent feature: 50/50 across the entire data set; 58/42 in the most popular citations between 1987 and 2011; 71/29 in abstracts between 2000 and 2011. It’s unfortunate that CCC did not publish abstracts until the late 1990s, so that the dates of the abstracts and the articles could be uniform.

The question of demographic balance is one that spills a lot of digital ink. Just this morning, Scott Weingart visualized the gender (im)balance of Digital Humanities Conference attendees: about a 70/30 split that favors males. And Google recently released the demographic characteristics of its workforce: 30% of its employees are women; 17% of its technical employees are women. 60% of its employees are white; 30% of its employees are Asian (read: East Asian and Indian); and only 3% of its employees are Non-Asian Minorities.

I asked Scott why our default assumption should be uniform demographic distribution. When looking at statistical trends that emerge at large scales, we shouldn’t be surprised to discover that human populations cluster differently. At least, that’s my default assumption. The DH Conference draws more males, but then, an Early Childhood Education conference will draw more females. (I once attended a conference on speech and behavior therapy for autistic children; there were no more than three or four males amid about seventy females.) Or take a look at the National Association for the Education of Young Children. Although we often hear about the male-ness of executive boards, the NAEYC’s executive team is entirely female, and its 17-member governing  board boasts 13 females and only 4 males. Looking at all the Early Childhood Education associations and organizations in the country, what gender trends would we expect to find?

The first question to ask about demographic distribution in any particular population (like Google’s workforce or citations in abstracts) is this: What are the characteristics of the larger population from which this particular population is drawing? As long as rhetorical scholars continue to look at rhetorical history, where most of the figures are male, then we can continue to expect many citations in these historical journals to be male. (This may change, however, as more and more rhetorical historians re-discover the history of female oratory.) Or, in Google’s case, if we take the American population as the baseline, assuming a 50/50 gender split, then clearly there is a gender imbalance. But in terms of race and ethnicity, its white workforce is in fact under-represented. Raising the percentage of blacks and Hispanics at Google would mean firing a lot of the Chinese and Indians, unless we want to make whites more under-represented than they already are. (A fairer baseline population would be the percentage of working-age adults in America, or, better yet, the percentage of working-age adults with college degrees; however, those stats are much harder to come by. Total population is a decent but imperfect proxy.)

The point is that we do not always find particular populations boasting a uniform or near-uniform demographic distribution. Why is this? A complex question. Given the totality of the human population (or, more humbly, the totality of any total population in a given geographic area), why do we find the smaller population clusters clustering the way they do around different practices? Why are there more males in CCC citations? Why are there more males at the DH Conference? Why are East Asians and Indians so over-represented at Google? Why are there so few East Asians and Indians in the NFL and the NBA? That populations cluster differently around different practices seems to be a statistical fact. Is it also a future inevitability?

A possible explanation for the emergence of quotative “like” in American English

So Monica was like, “What are you doing here, Chandler?” and Chandler was like, “Uhh nothing” and then Monica was like, “Why are you here with Phoebe?” and Chandler was like, “I don’t know,” and Monica was like, “Whatever!”

Quotative “be like” probably gets on your nerves. Unfortunately for you, it spread like wildfire in the latter half of the 20th century and today is used by native and non-native speakers alike as often as they use traditional say-type quotatives. What is its structure, when did it arise, and why did it spread so quickly? This post offers a possible explanation, based on evidence dragged up from the depths of the Google Books Corpus. To appreciate that evidence, however, we need to start with some discussion of this quotative’s formal properties.

1

One interesting property of quotative “be like” is its ambiguous semantics. In some contexts, it is a stative predicate that denotes internal speech, i.e., thoughts reflexive of an attitude. In other contexts, it is an eventive predicate denoting an actual speech act. Sometimes, the denotation is ambiguous, as in (1):

(1) Monica was like, “Oh my God!”

. . . Did Monica literally say “Oh my God!” or did she just think or feel it?

Another interesting property of quotative “be like” is that it disallows indirect speech.

(2a) Monica was like, “I should go to the mall.”

(2b) *Monica was like that she should go to the mall.

(2c) *Monica was like she should go to the mall

Quotative say of course allows indirect speech:

(3a) Monica said, “I should go to the mall.”

(3b) Monica said that she should go to the mall.

(3c) Monica said she should go to the mall.

Haddican et al. (2012) recognize that quotative “be like” is immune to indirect speech due to its mimetic implicature. (2b) cannot be allowed because quotative “be like” always means something more along these lines:

(4) Monica was like: QUOTE

Given the implied mimesis of this construction, it makes no sense, as in (2b) and (2c), to add an overt complementizer and to change person/tense to produce an indirect, third person report. This property is shared by all uses of quotative “be like,” whether in their stative or eventive readings.

But there’s more to it than a mimetic implicature. Schourup (1982) points out that quotative “go” also shares this mimetic property (although he does not frame it as such). As expected of a quotative with a mimetic implicature, quotative “go” likewise does not allow an indirect speech interpretation via addition of an overt complementizer and shifts in person/tense:

(5a) Monica goes, “I should go to the mall.”

(5b) *Monica goes that she should go to the mall.

Why should these innovative quotatives be so immune to indirect speech and so committed to direct quote marking? Schourup suggests that quotative “go” (and, by extension, quotative “be like”) arose precisely to meet English’s need for a mimetic, unambiguous direct quotation marker. Prior to the occurrence of these new quotatives, English lacked such a marker. Consider (6a) and (6b) below:

(6a) When I talked to him yesterday, Chandler said that you should go to the doctor.

(6b) When I talked to him yesterday, Chandler said you should go to the doctor.

There is no ambiguity in (6a). The speaker of this utterance clearly intends to convey to his interlocutor that Chandler said the interlocutor should go to the doctor. (6b), however, introduces ambiguity. The utterance in (6b) can be interpreted in two ways: a) Chandler said the speaker of the utterance (i.e., I) should go to the doctor; b) Chandler said the speaker’s interlocutor (i.e., you) should go to the doctor. With orthographic conventions, of course, this ambiguity disappears:

(6c) When I talked to him yesterday, Chandler said, “You should go to the doctor.” (So I went.)

However, unlike other languages, spoken English has no “quoting” conventions—it has no direct quote markers for unmarked speech. It is unclear if (6b) is a true quotative or merely an indirect report on speech with a null complementizer.

QuotvsInt

We can imagine speakers needing to clarify this ambiguity:

JOEY: When I talked to him yesterday, Chandler said you should go to the doctor.

ROSS: Wait, he said I should go or you should go?

This ambiguity arises with say-type verbs whenever the complementizer that is omitted. It is traditionally understood that English differentiates between direct quotatives and indirectly reported speech via shifts in person and/or tense. However, the overt complemetizer is really the central feature of this differentiation. Without an overt complementizer, it is never entirely clear if the embedded clause is a direct quote or an indirect report of speech. Here’s another example:

(7) JOEY: Chandler said I will be responsible for the cat’s funeral.

Without the aid of quote marks, we cannot know whether Chandler or Joey is responsible for the cat’s funeral, even though the embedded clause contains a shift in both person and tense. Of course, if Joey wants to convey that Joey himself will be responsible for the cat’s funeral, he can simply add the overt complementizer: “Chandler said that I will be responsible . . .” However, if Joey wants to convey that Chandler has decided to be responsible, Joey has no way to convey it unambiguously with say-type verbs. He must resort to an indirect speech construction with an overt complementizer. Alternatively, he can resort to non-structural signals: a short pause, a change in intonation, or a mimicry of Chandler’s voice. Or he must abandon say-type constructions altogether and convey his meaning some other way.

Quotative “go” and quotative “be like” solve this ambiguity. These innovative quotatives always signal that the following clause is mimetic, a direct quote of speech or thought. Many languages—Russian, Japanese, Georgian, Ancient Greek, to name just a few— have overt markers to ensure that interior clauses are understood as being directly quoted material, whether or not those quoted clauses contain grammatical shifts (though of course they often do). The quotatives “go” and “be like” serve this same purpose. They are structural, unambiguous markers for direct speech, which is why one cannot use them for indirect speech, and which is also why they have spread so widely and quickly: they have met a real need in the language.

Quotative “go,” however, is attested long before quotative “be like.” The Oxford English Dictionary puts the earliest usage in the early 19th century, initially as a way to mime sounds people made, then later as a way to report on actual speech. Here’s an example from Dickens’ Pickwick Papers:

DickensPickwick

So, although I have said that both quotative “be like” and quotative “go” met a need in English for an unambiguous direct quotation marker, it was “go” that in fact met the need first, by at least a century. This historical fact leads me to suspect that quotative “be like” met a slightly different need: while quotative “go” became a direct quotation marker for speech acts, quotative “be like” became a direct quotation marker for thoughts. As Haddican et al. rightly note, an innovative feature of these quotatives is that they allow direct quotes to be descriptors of states. In other words, the directly marked quotes of “go” denote external speech; the directly marked quotes of “be like” primarily denote internal speech, i.e., thoughts or attitudes. I believe this hypothesis is supported by the earliest uses of quotative “be like,” to which we now turn:

2

Today, young native and non-native speakers of English frequently use “like” as a versatile discourse marker or interjection in addition to its use as a quotative (D’Arcy 2005). D’Arcy provides two extreme examples of discourse marker “like.” Both are taken from a large corpus of spoken English:

(8) I love Carrie. LIKE, Carrie’s LIKE a little LIKE out-of-it but LIKE she’s the funniest, LIKE she’s a space-cadet.      Anyways, so she’s LIKE taking shots, she’s LIKE talking away to me, and she’s LIKE, “What’s wrong with you?”

(9) Well you just cut out LIKE a girl figure and a boy figure and then you’d cut out LIKE a dress or a skirt or a coat, and LIKE you’d colour it.

This usage does not become noticeable in available corpora until the 1980s, so nearly all papers that I have read assume that discourse marker “like” and qutoative “be like” arose more or less in tandem during the 1970s, becoming common by the 1980s. However, using the Google Books Corpus, I was able to find an early use of “like” that presages quotative “be like.” This early use also seems to set the stage for the versatile discursive uses of “like” seen in (8) and (9). This early use is the expression, “like wow.” It seems to have arisen during the 1950s (though perhaps earlier) in the early rock n’roll scenes in the Southern United States. Here are some examples.

The first is from 1957: a line from a rock n roll song by Tommy Sands:

(10) When you walk down the street, my heart skips a beat—man, like wow!

The second is from a 1960 issue of Business Education World:

(11) Like, wow! I’m taking a real cool course called general business. It’s the most.

BusinessEducationWorld

The third is from a novel called The Fugitive Pigeon, published in 1965:

(12) But all of a sudden you’re like wow, you know what I mean?

And by 1971, we have a full example of quotative “be like,”— note that this early occurrence uses an expletive as the subject:

(13) But to me it was like, “Oh, why can’t you say, ‘Gee that’s wonderful . . .’”

LifeMagazine1971

These early uses of “like wow” in (10) and (11) denote a stative feeling or attitude rather than any kind of eventive speech act. This is especially clear in (11), where the expression is a direct response to a question about how the speaker is feeling. The quotative in (13) likewise seems to be a stative predicate rather than an eventive one. In fact, in nearly all of the earliest 1uses of quotative “be like”—from the 1970s and early 1980s, as reported in the Google Books Corpus—the intention is to denote a feeling or attitude, not a direct quote of a speech act. Such eventive predications don’t become common until the 1990s and 2000s.

“Like wow,” then, arose in 1950s slang as a stative description. However, the sentence in (14) below suggests that wow was not interpreted as a structurally independent interjection but as an adjective. This is from a 1960 edition of Road and Track magazine:

(14) Man, that crate would look like wow with a Merc grille.

RoadTrack

It is possible that like is an adverb here, but in my estimation it is most likely still a garden variety manner preposition that has innovatively selected for a bare adjective. Typically, like as a preposition only selects NPs as its complement. However, with the advent of “like wow,” it loosened its selection requirements and began to select for adjectives as well. And not just adjectives. The bottom line in this advertisement from Billboard magazine in May 1960 demonstrates that it also began to select for adverbs:

BillboardLikeWowAd

Apparently, in the 1950s and early 1960s, like became a popular and versatile manner preposition. Once like loosened its requirements to select AP complements, it’s easy to see how it could start selecting quotes, thus becoming a new direct quote marker (like narrative “go”); and given the stative denotation of the original phrase “like wow,” it’s also easy to see why stative to be would become the verbal element in this quotative rather than a lexical verb like act or go. Indeed, it appears that the first uses of quotative “be like” were entirely restricted to the phrase “like wow,” ensuring that subsequent uses would likewise have stative readings. (The ad above also shows how easy it would be for like to become an all-around discourse marker once it began to select for a wider range of phrases.)

So, based on the timeline of evidence in the corpus, I posit the following evolution:

LikeEvolution

The emergence of quotative “like”

I follow Haddican et al. in assuming that like in quotative “be like” is still a manner preposition. However, while they assume the preposition did not undergo any change, I argue that like became more versatile in its selection restrictions. This versatility allowed it first to select APs, then to select quotes. Initially, this quotative construction was just an extension of the phrase “like wow,” but it soon began to select any quoted material. And from the beginning, this quotative possessed two features: a) it had an obvious mimetic implicature, ensuring that it would be a direct quote marker, similar to narrative “go”; and b) it had a stative denotation, due to the stative dentation of the original phrase “like wow,” ensuring that the directly marked quotes were reflective of internal speech, i.e., thoughts or attitudes.

A corpus analysis by Buchstaller (2001) has shown that, even today, quotative “go” is much more likely than quotative “be like” to frame “real, occurring speech” (pp. 10); in other words, “be like” continues to be used more often as a stative rather than eventive predicate. As I mentioned earlier, Haddican et al. are correct that one innovative aspect of quotative “be like” is that quotes are now able to be descriptors of states; however, I believe they overstate the eventive vs. stative ambiguity that arises in these quotatives. Most of the time, in real contexts, they are as unambiguously stative as they are unambiguously mimetic of the state. Haddican et al. themselves note that even these eventive readings are open to clarification. Asking whether or not someone “literally” said something sounds much odder following a say-type quotative than a “be like” quotative with a putatively eventive reading.

3

Nevertheless, as I showed at the very beginning of this post, there are instances where quotative “be like” seems to denote an eventive speech act. Linguistically, this is odder than it sounds at first. A single verbal construction—like quotative “be like”—should not have a stative and eventive reading. This ambiguity can only happen for two reasons: either there is some special semantic function at work in this construction, or there are in fact two separate quotative constructions, each with its own syntactic structures.

It is tempting to see a correlation between this ambiguity and the putative ambiguity between stative be and eventive be, also known as the be of activity. Consider the following sentences:

(15) Joey was silly.

(16) Rachel asked Joey to be silly.

Both forms of be select an adjective; however, (16), unlike (15), can be taken to mean that Joey performed some silly action. In other words, the small clause in (16) seems to be an eventive predication, not a stative one. It has been argued (Parsons 1990) that this eventive be is not the usual copular form but a completely different verb that means something like “to act”—in other words, English to be is actually a homophonous pair of verbs, similar to auxiliary have and possessive have. Perhaps this lexical ambiguity in be is related to the eventive vs. stative ambiguity in quotative “be like.” The stative reading arises when stative be is involved; the eventive reading arises when the eventive, lexical be is involved.

Haddican et al. argue against this line of thought. Diachronically, we know that quotative “be like” has arisen rapidly in many varieties of English, and that in all of these varieties, the semantics are ambiguous. But if there are in fact two be verbs that underwent this quotative innovation, then we would need to posit two unrelated channels of change: one in which like+QUOTE became a possible complement of stative be and one in which like+QUOTE became a possible complement of eventive be.

This is actually a problematic claim, given that, presumably, stative and eventive be have different structures. The former undergoes its typical V to T movement in English; the latter, given its eventive semantics, would be expected to remain in the VP like any other lexical verb. These underlying structures would demand that we devise different processes by which qutoative “be like” arose. However, given the rapidity with which it did in fact arise, it is more probable that it arose via a single process—and the inevitable conclusion is that there is a single, stative verb to be that underwent the process. This conclusion is also verified by the auxiliary-like behavior of be in quotatives involving adverbs and questions:

(17) Ross was totally like, “I don’t care!”

(18) Was Ross like, “I don’t care”?

Although the ambiguous stative vs. eventive reading still occurs here, (17) exhibits raising above AdvP, and (18) exhibits subject-aux inversion. In other words, be in these quotatives behaves like an ordinary copular auxiliary, not a lexical verb. We therefore should not posit a separate, eventive be verb. We need another way to explain the semantic ambiguity of these quotatives.

Haddican et al. explain this ambiguity with Davidsonian semantics. Briefly stated, they argue that there is a single stative be verb—both in these qutoative constructions and in English more generally. However, be has a semantic LOCALE function that, in certain contexts, can localize the state in a short-term event, and this localization of an event can force an agentive role onto the subject, even when an adjective has been selected by be. So, in a sentence such as (19), be will have a denotation as in (20):

(19) Joey is being silly.

(20) [[be]] = λSλeλx. ∃s ϵS [e = LOCALE(s) & ARGUMENT(x,e)]

(20) takes a property of state S and localizes it into an event (a moment in which Joey was silly); in the right context, it is not a great leap to coerce this experiencer event into an agentive one. The application of these semantics to “(be) like” quotatives is straightforward:

In the state reading, be like is simply a stage level use of the copula, localised to the event in which the subject of be exhibited the relevant behaviour. The eventive reading arises when the event mapped to is an agentive one, where the most plausible event of an agent behaving in a quotative manner is the relevant speech act. (Haddical et al. 2012 pp. 85)

In short, the ambiguity between stative and eventive “be like” arises from a semantic property that forces certain “states of being” to be processed as localized events whereby the experiencer of the event takes on an agentive role. In certain quotative contexts, the embedded quote is processed as an event, and the subject is understood as having caused that event, i.e, as actually saying something rather than just experiencing an attitude.

I agree that it would be better not to posit two homophonous verbs (stative be vs. be of activity) to account for the ambiguous stative vs. eventive denotations of quotative “be like.” Doing so requires two separate analyses and two separate channels of diffusion, which seems unlikely given the rapidity with which this quotative did in fact spread across many varieties of English. However, Haddican et. al’s application of Davidsonian semantics to explain the ambiguous readings runs into a problem in sentences like (21) below, as well as in the earlier example in (13):

(21) It was like, “Oh Mom, Can I film a movie in the house, it won’t be any problem at all.”

This is clearly an eventive predication of quotative “be like.” But instead of an agentive subject we have expletive it. Recall that Haddican et. al’s analysis relies on the notion that stative be has a LOCALE function that locates the state into a temporary moment or event. This localization can coerce an experiencer subject into the role of an agentive subject when the most likely reading (as above) suggests that the temporary event was an actual speech act. As Haddican et al. say themselves, “this event assigns an agentive role to the subject” (pp. 85). However, by definition, the expletive in (21) receives no theta role and can therefore be neither the experiencer of a state nor the agent of an event. And yet (21) clearly denotes an eventive reading: the speaker actually spoke the words, or something like them.

The fact that “be like” quotatives can take an eventive (or even a stative) reading when an expletive surfaces in spec-TP suggests that Davidsonian semantics do not explain the ambiguous eventive vs. stative readings associated with these quotatives. (The fact that “be like” quotatives exhibit both experiencer subjects and expletive subjects also suggests that the quote CP is the only obligatory argument assigned by “be like.”)

The only alternative seems to be that there are in fact two homophonous be verbs, and quotative “be like” makes use of both. Maybe this isn’t such a big deal. If I’m right about the diachronic process by which quotative “be like” arose, then we can at least see a two-step process: quotative “be like” was solely a stative predicate in its early use and for most of its early history; only later did it begin to be used as an eventive predicate. And if there are in fact two be verbs, the eventive sounds exactly like the stative and is in fact much rarer than the stative, so I suppose one can see how these facts laid the groundwork for the eventual use of stative “be like” as an eventive predicate.

Distant Reading and the “Evolution” Metaphor

1

Are there any corpora that purposefully avoid “diachronicity”? There are corpora that possess no meta-data about publication dates and whose texts are therefore organized by some other scheme—for example, the IMDB movie review corpus, which is organized according to positive/negative polarity; its texts, as far as I know, are not arranged chronologically or coded for time in any way. And there are cases where time-related data are not available, easily or at all. But have any corpora been compiled with dates—the time element—purposefully elided? Is time ever left out of a corpus because that information might be considered “noise” to researchers?

Maybe in rare situations. But for most corpora whose texts span any length of time greater than a year, the texts are, if possible, arranged chronologically or somehow tagged with date information. In this universe, time flows in one direction, so assembling hundreds or thousands of texts with meta-data related to their dates of publication means the resulting corpus will possess an inherent diachronicity whether we want it to or not. We can re-arrange the corpus for machine-learning purposes, but the “time stamp” is always there, ready to be explored. Who wouldn’t want to explore it?

If we have a lot of texts—any data, really—that span a great length of time, and if we look at features in those data across the time span, what do we end up studying? In nearly all cases, we end up studying patterns of formal change and transformation across spans of time. The “evolution” metaphor suggests itself immediately. Be honest, now, you were thinking about it the minute you compiled the corpus.

One can, of course, use “evolution” as a general synonym for change. This is probably the case for Thomas Miller’s The Evolution of College English and for many other studies whose data extend only to a limited number of representative sources. However, when it comes to distant readings, the word becomes much more tempting. The trees of Moretti’s Graphs, Maps, Trees are explicitly evolutionary:

For Darwin, ‘divergence of character’ interacts throughout history with ‘natural selection and extinction': as variations grow apart from each other, selection intervenes, allowing only a few to survive. In a seminar a few years ago, I addressed the analogous problem of literary survival, using as a test case the early stages of British detective fiction . . . (70-71)

The same book ends with an afterword by geneticist Alberto Piazza (who worked with Luigi Luca Cavalli-Sforza on The History and Geography of Human Genes). Piazza writes:

[Moretti’s writings] struck me by their ambition to tell the ‘stories’ of literary structures, or the evolution over time and space of cultural traits considered not in their singularity, but their complexity. An evolution, in other words, ‘viewed from afar’, analogous at least in certain respects to that which I have taught and practiced in my study of genetics. (95)

Analogous at least in certain respects . . . For Moretti and Piazza, literary evolution is not just a synonym for change in literature. Biological evolution becomes a guiding metaphor (not perfect, by any means) for the processes of formal change analyzed by Moretti. Piazza continues:

The student of biological evolution is especially interested in the root of a [phylogenetic] tree (the time it originated). . . . The student of literary evolution, on the other hand, is interested not so much in the root of the tree (because it is situated in a known historical epoch) as in its trajectory, or metamorphoses. This is an interest much closer to the study of the evolution of a gene, the particular nature of whose mutations, and the filter operated by natural selection, one wants to understand . . . (112-113)

Obviously, for Piazza, Moretti’s study of changes to and migrations of literary form in time and space evokes the processes and mechanisms of biological evolution—there’s not a one-to-one correspondence, of course, and Piazza points this out at length, but the similarities are evocative enough that he, a population geneticist, felt confident publishing his thoughts on the subject.

In Distant Reading, Moretti has more recently acknowledged that the intense data collection and quantitative analysis that has marked work at Stanford’s Literary Lab must at some point heed “the need for a theoretical framework” (122). Regarding that framework, he writes:

The results of the [quantitative] exploration are finally beginning to settle, and the un-theoretical interlude is ending; in fact, a desire for a general theory of the new literary archive is slowly emerging in the world of digital humanities. It is on this new empirical terrain that the next encounter of evolutionary theory and historical materialism is likely to take place. (122)

In Macroanalysis, Matthew Jockers also acknowledges (and resists) the temptation to initiate an encounter between evolutionary theory and the quantitative, diachronic data compiled in his book:

. . . the presence of recurring themes and recurring habits of style inevitably leads us to ask the more difficult questions about influence and about whether these are links in a systematic chain or just arbitrary, coincidental anomalies in a disorganized and chaotic world of authorial creativity, intertextuality, and bidirectional dialogics . . .

“Evolution” leaps to mind as a possible explanation. Information and ideas do behave in a ways that seem evolutionary. Nevertheless, I prefer to avoid the word evolution: books are not organisms; they do not breed. The metaphor for this process breaks down quickly, and so I do better to insert myself into the safer, though perhaps more complex, tradition of literary “influence” . . . (155)

And in the last chapter to Why Literary Periods Mattered, Ted Underwood does not mention evolution at all but there is clearly an evolutionary connotation to the terms he uses to describe digital humanities’ influence on literary scholars’ conception of history:

. . . digital and quantitative methods are a valuable addition to literary study . . . because their ability to represent gradual, macroscopic change brings a healthy theoretical diversity to literary historicism . . .

. . . we need to let quantitative methods do what they do best: map broad patterns and trace gradients of change. (159, 170)

Underwood also discusses “trac[ing] processes of change” (160) and “causal continuity” (161). The entire thrust of Underwood’s argument, in fact, is that distant or quantitative readings of literature will force scholars to stop reading literary history as a series of discrete periods or sharp cultural “turns” and to view it instead as a process of gradual change in response to extra-literary forces—“Romanticism” didn’t just become “Naturalism” any more than homo erectus one decade decided to become homo sapiens.

Tracing processes of gradual, macroscopic change . . . if that doesn’t invoke evolutionary theory, I don’t know what does. Underwood doesn’t even need to use the word.

Moretti, Jockers, and Underwood are three big names in digital humanities who have recognized, either explicitly or implicitly, that distant reading puts us face to face with cultural transformation on a large, diachronic scale. Anyone working with DH methods has likely recognized the same thing. Like I said, be honest: you were already thinking about this before you learned to topic model or use the NLTK.

 

2

Human culture changes—its artifacts, its forms. This is not up for debate. Even if we think human history is a series of variations on a theme, the mutability of cultural form remains undeniable, even more undeniable than the mutability of biological form. Distant reading, done cautiously, gives us a macro-scale, quantitative view of that change, a view simply not possible to achieve at the scale of individual texts or artifacts. Given the fact of cultural transformation, then, and DH’s potential to visualize it, to quantify aspects of it, one of two positions must be taken.

1. The diachronic patterns we discover in our distant readings are, to use Jockers’ words, “just arbitrary, coincidental anomalies in a disorganized and chaotic world of authorial creativity, intertextuality, and bidirectional dialogics.” Theorizing the patterns is a fool’s errand.

2. The diachronic patterns we discover are not arbitrary or random. Theorizing the patterns is a worthwhile activity.

Either we believe that there are processes guiding cultural change (or, at least, that it’s worthwhile to discover whether or not there are such processes) or we assume a priori that no such processes exist. (A third position, I suppose, is to believe that such processes exist but we can never know them because they are too complex.) We can all decide differently. But those who adopt the first position should kindly leave the others to their work. In my view, certain criticisms of distant reading amount to an admonition that “What you’re trying to do just can’t be done.” We’ll see.

 

3

When we decide to theorize data from distant readings, what are we theorizing? Moretti, Jockers, and Underwood each provide a similar answer: we are theorizing changes to a cultural form over time and, in some instances, space. Certain questions present themselves immediately: Are the changes novel and divergent, or are they repeating and reticulating? Is the change continuous and gradual, or are there moments of punctuated equilibrium? How do we determine causation? Are purely internal mechanisms at work, or also external dynamics? A complex interplay of both internal mechanisms and external dynamics? How do we reduce data further or add layers of them to untangle the vectors of causation?

To me, all of this sounds purely evolutionary. Even talking about gradual vs. quick change is a discussion taken right out of Darwinian theory.

But we needn’t adopt the metaphor explicitly if we are troubled that it breaks down at certain points. Alex Reid writes:

Matthew Jockers remarks following his own digital-humanistic investigation, “Evolution is the word I am drawn to, and it is a word that I must ultimately eschew. Although my little corpus appears to behave in an evolutionary manner, surely it cannot be as flawlessly rule bound and elegant as evolution” (171). As he notes elsewhere, evolution is a limited metaphor for literary production because “books are not organisms; they do not breed.” He turns instead to the more familiar concept of “influence” . . . Certainly there is no reason to expect that books would “breed” in the same way biological organisms do (even though those organisms reproduce via a rich variety of means). [However], if literary production were imagined to be undertaken through a network of compositional and cognitive agents, then such productions would not be limited to the capacity of a human to be influenced. Jockers may be right that “evolution” is not the most felicitous term, primarily because of its connection to biological reproduction, but an evolutionary-type process, a process as “natural” as it is “cultural,” as “nonhuman” as it is “human,” may exist.

An “evolutionary-type” process of culture is what we’re after, one that is not necessarily reliant on human agency alone. Will it end up being “flawlessly rule bound and elegant as evolution”? First, I think Jockers seriously over-estimates the “flawless” nature of evolutionary theory and population genetics. If the theory of evolution is so flawless and elegant, and all the science settled, what do biologists and geneticists do all day? Here’s a recent statement from the NSF:

Understanding the tree of life has been a goal of evolutionary biologists since the time of Darwin. During the past decade, unprecedented gains in gathering and analyzing phylogenetic data have demonstrated increasingly complex genealogical patterns.

. . . . Our current knowledge of processes such as hybridization, endosymbiosis and lateral gene transfer makes clear that the evolutionary history of life on Earth cannot accurately be depicted as a single, typological, bifurcating tree.

Moretti, it turns out, needn’t worry so much about the fact that cultural evolution reticulates. And Jockers needn’t assume that biological evolution is elegantly settled stuff.

Secondly, as Reid argues, we needn’t hope to discover a system of influence and cultural change that can be reduced to equations. We probably won’t find any such thing. However, within all the textual data, we can optimistically hope to find regularities, patterns that can be used to make predictions about what might be found elsewhere, patterns that might connect without casuistic contrivance to theories from the sciences. Here’s an example, one I’ve used several times on this blog: Derek Mueller’s distant reading of the journal College Composition and Communication. Mueller used article citations as his object of analysis. When he counted and graphed a quarter century of citations in the journal, he discovered patterns that looked like this:

muellerlongtail

Actually, based on similar studies of academic citation patterns, we could have predicted that Mueller would discover this power law distribution. It turns out that academic citations—a purely cultural form, a textual artifact constructed through the practices of the academy—behave according to a statistical law that seems to affect all sorts of things, from earthquakes to word frequencies. This example makes a strong case against those who argue that cultural artifacts, constructed by human agents within their contextualized interactions, will not aggregate over time into scientifically recognizable patterns.  Granted, this example comes from mathematics, not evolutionary theory, but it makes the point nicely anyway: the creations of human culture are not necessarily free from non-human processes. Is it foolish to look for the effects of these processes through distant reading?

 

4

“Evolution,” “influence,” “gradualism”—whatever we call it in the digital humanities, those of us adopting it on the literary and rhetorical end have a huge advantage over those working in history: we have a well-defined, observable element, an analogue of DNA, to which we can always reduce our objects of study: words. If evolution is going to be a guiding metaphor, we need this observable element because it is through observations of its metamorphoses (in usage, frequency, etc.) that we begin to figure out the mechanisms and dynamics that actually cause or influence those metamorphoses. If we had no well-defined segment to observe and quantify, the evolutionary metaphor could be thrown right out.

To demonstrate its importance, allow me a rhetorical demonstration. First, I’ll write out Piazza’s description of biological evolution found in his afterword to Graphs, Maps, Trees. Then, I’ll reproduce the passage, substituting lexical and rhetorical terms for “genes” but leaving everything else more or less the same. Let’s see how it turns out:

Recognizing the role biological variability plays in the reconstruction of the memory of our (biological) past requires ways to visualize and elaborate data at our disposal on a geographical basis. To this end, let us consider a gene (a segment of DNA possessed of a specific, ascertainable biological function); and for each gene let us analyze its identifiable variants, or alleles. The percentage of individuals who carry a given allele may vary (very widely) from one geographical locality to another. If we can verify the presence or absence of that allele in a sufficient number of individuals living in a circumscribed and uniform geographical area, we can draw maps whose isolines will join all the points with the same proportion of alleles.

The geographical distribution of such genetic frequencies can yield indications and instruments of measurement of the greatest interest for the study of the evolutionary mechanisms that generate genetic differences between human populations. But their interpretation involves quite complex problems. When two human populations are genetically similar, the resemblance may be the result of a common historical origin, but it can also be due to their settlement in similar physical (for example, climactic) environments. Nor should we forget that styles of life and cultural attitudes of an analogous nature (for example, dietary regimes) can favour the increase or decrease to the point of extinction of certain genes.

Why do genes (and hence their frequencies) vary over time and space? They do so because the DNA sequences of which they are composed can change by accident. Such change, or mutations, occurs very rarely, and when it happens, it persists equally rarely in a given population in the long run . . . From an evolutionary point of view, the mechanism of mutation is very important because it introduces innovations . . .

. . . The evolutionary mechanism capable of chancing the genetic structure of a population most swiftly is natural selection, which favours the genetic types best adapted for survival to sexual maturity, or with a higher fertility. Natural selection, whose action is continuous over time, having to eliminate mutations that are injurious in a given habitat, is the mechanism that adapts a population to the environment that surrounds it. (100-101)

Now for the “distant reading” version:

Recognizing the role lexical variability plays in the reconstruction of the memory of our (literary and rhetorical) past requires ways to visualize and elaborate data at our disposal on the basis of cultural space (which often correlates with geography). To this end, let us consider a word (a segment of phonemes and morphemes possessed of a specific, ascertainable grammatical or semantic function); and for each word let us analyze its stylistic variants, or synonyms. The percentage of texts that carry a given stylistic variant may vary from one cultural space to another, or from one genre to the other. If we can verify the presence or absence of that variant in a sufficient number of texts produced in a circumscribed and uniform cultural space we can draw maps whose isolines will join all the points with the same proportion of stylistic variants.

The distribution of such lexical frequencies can yield indications and instruments of measurement of the greatest interest for the study of the evolutionary mechanisms that generate lexical differences between “generic populations.” But their interpretation involves quite complex problems. When two rhetorical forms or genres are lexically similar, the resemblance may be the result of a common historical origin, but it can also be due to their development in similar geographic or political environments. Nor should we forget that styles of life and cultural attitudes of an analogous nature (for example, religious dictates) can favour the increase or decrease to the point of extinction of certain lexical items or clusters of lexical items.

Why do words (and hence their frequencies and “clusterings”) vary over time and space? They do so because of stylistic innovations. Such innovation occurs very rarely, and when it happens, it persists equally rarely in a given generic population in the long run . . . From an evolutionary point of view, the mechanism of innovation is very important because it introduces new rhetorical forms . . .

. . . The evolutionary mechanism capable of changing the lexical structure of a rhetorical form or genre most swiftly is cultural selection, which favours the forms best adapted for survival to publication and circulation, or with a higher degree of influence (meaning a higher likelihood of being reproduced by others without too many changes). Cultural selection, whose action is continuous over time, having to eliminate rhetorical innovations or “mutations” that are injurious in a given cultural habitat, is the mechanism that adapts a rhetorical form to the environment that surrounds it.

Obviously, it’s not perfect. I leave it to the reader to decide its persuasive potential.

I think the biggest problem is in the handling of mutations. In biological evolution, genes mutate via chance variations during replication of their segments; these mutations can introduce innovations in an organism’s form or function. In literary evolution, however, no sharp distinction exists between a lower-scale “mutation” and the innovation it introduces. The innovation is the formal mutation. This issue arises because, in literary evolution, as in linguistic evolution, the genotype/phenotype distinction is not as obvious or strictly scaled as it is in evolutionary theory. Words are more phenotype than genotype, unless we want to get lost in an overly complex evocation of morphology and phonology.

The metaphor always breaks down somewhere, but where it works, it is, I think, highly suggestive: the idea is that we track rhetorical forms—constellations of words and their stylistic variants—across time and space, in order to see where the forms replicate and where they disappear. Attach meta-data to the texts that constitute those forms, and we will have what it takes to begin making data-driven arguments about how cultural ecology affects or does not affect cultural form.

It’s an interesting framework in which distant reading might go forward, even if explicit uses of the word “evolution” are abandoned.

Graphing Citations and Making Sense of Disciplinary Divisions

A Pareto distribution: the troubling result of Derek Mueller’s distant reading of citations in College Composition and Communication: a “long tail” of citations, a handful of names cited many times but exponentially more names cited only once. Out of 8,035 unique citations, 5,761 were cited once and 986 were cited twice. In other words, 84% of citations in CCC occurred only once or twice in a 25 year period.

Troubling, but unsurprising. Physical and social scientists have long known that power law distributions occur across a wide variety of phenomena, including academic citations (Gupta et al. 2005). That a long tail occurs in a rhet/comp journal simply puts our discipline in the same position as everyone else: a small group of scholarly work has gained a “cumulative advantage” or “preferential attachment” and thus become the core set of classic texts recognized by the field, while most other scholars fail to produce texts that cross the tipping point toward their own preferential attachment. It is usually assumed that this core group of scholars is what unites a discipline. To some extent, the assumption is probably true. However, Mueller is right to ask how far a citation trail can lead away from that core group of scholars before we start questioning just how unified a discipline really is.

When graphing citation counts, it’s not problematic to discover a steep drop between the most cited scholar and the tenth most cited scholar; nor is it problematic that most sources are cited infrequently. The problem is not the long tail. The problem, in CCC’s case, is that the long tail very rapidly approaches a value equal to one. This indicates that any given source in CCC is valuable to the scholar citing it but effectively worthless to everybody else who publishes in the journal. If most citations occurred three, four, five times, even that would suggest a certain unity of purpose—what one scholar has found valuable, several others have found valuable as well, in various issues and various contexts. But when the long tail is mostly comprised of sources cited once and never again? That requires a more robust explanation than a nod toward a core group of scholars can provide. Mueller thus raises the right question:

Although we do not at this time have data from all of the major journals to investigate this fully, the changing shape of the graphed distribution reiterates more emphatically a question only hinted at . . . but one nevertheless crucial to the idea of a common disciplinary domain: How flat can the citation distribution become before it is no longer plausible to speak of a discipline?

To answer Mueller’s call for more data, I have compiled article abstracts from CCC and two other major journals in the field—Rhetoric Society Quarterly and Rhetoric Review. I intend this post to serve as a tentative response to the question posed by Mueller at the end of this quote.  The CCC abstracts run from February 2000 (51.3) to September 2011 (63.1), a total of 261 abstracts. The RSQ abstracts run from Winter 2000 (30.1) to Fall 2011 (41.5), a total of 220 abstracts. The RR abstracts run from 2002 (21.3) to 2011 (30.4), a total of 154 abstracts.

Only abstracts, not full articles. However, because only the most important citations appear in abstracts, I think tallying abstract citations offers the best chance to shorten the long tail and partially alleviate the implications of Mueller’s work. It is not a slight to the humanities to point out that articles demand more citations than their arguments actually require: many article citations can be removed without affecting anything vital to an argument. Citations in abstracts, on the other hand, are in most cases central to the argument or study undertaken. If we count only the most important sources in each journal—the ones that surface in abstracts—is the long tail of citation distributions less pronounced? We can expect to discover a long tail. That’s a mathematical inevitability. But if a journal—to say nothing of an entire discipline—is somehow unified, citations in abstracts should have a slightly less extreme power law distribution than citations in the articles themselves. Abstract citations are the “cream of the crop,” those vital enough to make it into the space constraints of the abstract genre: we hope to find fewer citations and therefore a graph that does not drop so precipitously toward x=1.

Methods: Each corpus was uploaded to the Natural Language Toolkit and tagged for part of speech. Then I compiled proper nouns. The proper noun list was larger than but included proper names. I extracted these names—noun forms (e.g. ‘Burke’ or ‘Burke’s) and adjective forms (e.g. ‘Burkean’)—and tracked them across the abstracts. I compiled each unique citation as well as the number of times each was cited in an abstract.

Finding citation names

Finding citation names

Here are spreadsheets with the unique citations and their citation counts in each abstracts corpus: College Composition and Communication. Rhetoric Society Quarterly. Rhetoric Review.

There are 79 unique citations in the CCC abstracts; 159 unique citations in the RSQ abstracts; and 121 unique citations in the RR abstracts. Only six citations occur in both the RSQ and CCC abstracts corpora: Mina Shuaghnessy, Kenneth Burke, John Dewey, Donald Davidson, Peter Elbow, and Mikhail Bakhtin. When factoring in RR, only Kenneth Burke, John Dewey, and Peter Elbow are shared across all three corpora. RR and RSQ share quite a few sources, almost all of which are historical figures—Plato, Aristotle, Cicero, Isocrates, and the like. Kenneth Burke is the most frequently cited source in each abstracts corpus: he is cited in 5 separate abstracts in CCC, 17 in RSQ, and 14 in RR. Maybe “rhetoric and composition” should be changed to “Burkean studies.” No surprise—the man has his own journal.

Based on the raw count of unique citations in each journal—on average, less than one per abstract—I think my original suggestion is at least partially correct: counting citations in abstracts controls for the rhetorical demand of articles to cite more sources than necessary. Abstract citations are the stars of the show. Nevertheless, after graphing the citations, Pareto distributions did emerge:

CCC abstract citations

CCC abstract citations

RSQ abstract citations

RSQ abstract citations

RR abstract citations

RR abstract citations

Citations in the CCC abstracts occurred in a slightly more even distribution than citations in CCC articles (c.f., Mueller). But then, there aren’t many citations in this corpus, relative to the RSQ and RR corpora. Among the citations that do appear, none occur in numbers much greater than those occurring in only one abstract. The citation occurring most frequently—Burke—occurs in five abstracts. Does this graph confirm Mueller’s conclusion about a dappled CCC? To some extent, yes. There’s still a long tail, after all . . .

RSQ citations even more obviously display the Pareto distribution discussed in Mueller’s article. The citations occurring most frequently—Burke and Plato—surface in 17 and 14 abstracts, respectively.

The distribution in RR is also uneven, and the drop of the long tail is even more precipitous than the one in RSQ. Burke is cited in 14 abstracts and the next most frequent source, Aristotle, is cited in 5 abstracts.

These graphs indicate that even in article abstracts—where only the most vital sources are invoked—a small canon of core scholars emerges beside an otherwise long, flat, dapple distribution of citations. More divergence and specialization, then—not just in CCC but in RR and RSQ.

I think there’s more to it than disciplinary divergence, however. These long tails can undoubtedly be explained mathematically—the conclusion: they’re inevitable—but in this particular case they might also be explainable in prosaic terms. And I believe this prosaic explanation makes sense of the long tail in a way that salvages a shred of disciplinary unity within each journal:

In RR and RSQ, for example, an obvious citation pattern emerges. Five of the ten most cited sources in the RSQ abstracts are historical figures: Plato, Aristotle, Quintilian, Blair, and Cicero. In RR, the exact same thing: Aristotle, Cicero, Isocrates, Plato, Quintilian. But glancing through the long tail in both citation counts, historical figures continue to emerge, mostly from the Greco-Roman world, but from beyond it, as well. In the CCC long tail, on the other hand, historical figures occur in less frequent numbers, and only two pre-19th century.

Raw numbers for RR and RSQ: 27 (or 22%) of the RR citations are sources from the 17th century or earlier. 26 (or 16%) of RSQ citations are from the same period. Most are Greco-Roman sources, but Confucius, Montaigne, and Averroes are also scattered throughout the long tail. We might conclude, then, that a decently sized community of historians of rhetoric communicate in RSQ and RR (when they’re not communicating in Rhetorica, presumably). Their communication adds to the long tail, but does it signify disciplinary divergence and specialization?

Rather, here is one disciplinary community—historians of rhetoric—mapped out in unity. Its borders extend slightly into CCC but its principal territory lies in RSQ and RR. An obvious outcome, if you’re involved in the field. However, it also helps us make partial sense of that worrying Pareto distribution: not all of the singular citations that constitute the long tail are as disconnected as the graphs lead us to believe. In RSQ and RR, many singular citations could be grouped together: Plutarch, Laertius, Strabo, Aristophanes—these are, at least, not as indicative of a dappled disciplinary identity as, say, St. Paul and Steven Mailloux.

The same point can be made with pedagogy in the CCC abstracts. It is not surprising, of course, that CCC is home to scholars citing pedagogically-inclined sources; however, for a second time, this obvious point helps make sense of the Pareto distribution of citations presented here and in Mueller’s article: Charles Pierce, Mina Shuaghnessy, Melvin Tolson, Les Perelman—each appears only once, scattered throughout the long tail of abstract citations. But each is invoked for its direct relevance to writing pedagogy. Viewed in this way, the flat distribution of citations seems a little less dappled.

Robo-Graders

I was wrong about the mechanization of student writing. I had assumed another year or two would pass before MOOCs began utilizing essay grading software. Turns out it’s happening now. EdX, founded by Harvard and probably the most prestigious online course program, has anounced that it will implement its own assessment software to grade student writing.

Marc Bousquet’s essay successfully mines the reasons why humanities profs are anxious about algorithmic scoring. The reality is, across many disciplines, the writing we ask our students to do is “already mechanized.” The five-paragraph essay, the research paper, the literature review . . . these are all written genres with well-defined parameters and expectations. And if you have parameters and expectations for a text, it’s quite easy to write algorithms to check whether the parameters were followed and the expectations met.

The only way to ensure that a written product cannot be machine graded is to ensure that it has ill-defined parameters and vague or subjective expectations. For example, the expectations for fiction and poetry are highly subjective—dependent, ultimately, on individual authors and the myriad reasons why people enjoy those authors. It might be possible to machine grade a Stephen King novel on its Stephen-King-ness (based on the expected style and form of a Stephen King novel), but otherwise, it will remain forever impossible to quantitatively ‘score’ novels qua novels or poems qua poems, and there’s no market for doing that anyway. Publishers will never replace their front-line readers and agents with robots who can differentiate good fiction from bad fiction.

However, when we talk about student writing in an academic context, we’re not talking about fiction or poetry. We’re talking about texts that are highly formulaic and designed to follow certain patterns, templates, and standardized rhetorical moves. This description might sound like fingernails on a chalkboard to some, but look, in the academic world, written standards and expectations are necessary to optimize for the clearest possible communication of ideas. The purpose of lower division writing requirements is to enculturate students into the various modes of written communication they are expected to follow as psychologists, historians, literary critics, or whatever.

Each discourse community, each discipline, has its own way of writing, but the differences aren’t anywhere near incommensurable (the major differences exist across the supra-disciplines: hard sciences, soft sciences, social sciences, humanities). No matter the discipline, however, there is a standard way that members of that discipline are expected to write and communicate—in other words, texts in academia will always need to conform to well-defined parameters and expectations. Don’t believe it? One of the most popular handbooks for student writers, They Say/I Say, is a hundred pages of templates. And they work.

So what’s my point? My point is that it’s very possible to machine-grade academic writing in a fair and useful way because academic writing by definition will have surface markers that can be checked with algorithms. Clearly, the one-size-fits-all software programs, like the ones ETS uses, are problematic and too general. Well, all that means is that any day now, a company will start offering essay-grading software tailor-made for your own university’s writing program, or psychology department, or history department, or Writing Across the Curriculum program, or whatever—software designed to score the kind of writing expected in those programs. Never bet against technology and free enterprise.

And that’s another major point—there’s not a market for robot readers at publishing firms, but there certainly is a market for software that can grade student writing. And wherever there’s a need or a want or some other exigence, technology will fill the void. The exigence in academia is that there are more students than ever and less money to pay for full-time faculty to teach these students. Of course, this state of affairs isn’t an exigence for the Ivy League, major state flagships, or other elite institutions—these campuses are not designed for the masses. The undergraduate population at Yale hasn’t changed since 1978. A few years ago, a generous alumnus announced his plans to fund an increase in MIT’s undergraduate body—by a whopping 250 students. Such institutions will continue to be what they are: boutique experiences for the future elite. I imagine that Human-Graded Writing will continue to be a mainstay at these boutique campuses, kind of like Grown Local stickers are a mainstay of Whole Foods.

For the vast majority of undergraduates—those at smaller state colleges, online universities, or those trying to graduate in 4 years by taking courses through EdX—machine-grading will be an inevitable reality. Why? It fulfills both exigencies I mentioned above. It allows colleges to cut costs while simultaneously making it easier to get more students in and out of the door. Instead of employing ten adjuncts or teaching associates to grade papers, you just need a single tenure-track professor who posts lectures and uploads essays with a few clicks.

So, the question for teachers of writing (the question for any professors who value writing in their courses) is not “How can we stop machine-grading from infiltrating the university?” It’s here. It’s available. Rather, the question should be, “How can we best use it?”

Off the top of my head . . .

Grammar, mechanics, and formatting. Unless we’re teaching ESL writing or remedial English, these aspects tend to get downplayed. I know I rarely talk about participial clauses or the accusative case. I overlook errors all the time, focusing instead on higher-order concerns—say, whether or not a secondary source was really put to use or just quoted to fill a requirement. However, I don’t think it’s a good thing that we overlook these errors. We do so because there are only so many minutes in a class or a meeting. With essay-grading software, we can bring sentence-level issues to students’ attention without taking time away from higher-order concerns.

Quicker response times for ESL students, and, perhaps, more detailed responses than a single instructor could provide, especially if she’s teaching half-a-dozen courses. Anyone who has tried to learn a second language knows that waiting a week or two for teacher feedback on your writing is a drag. In my German courses, I always wished I could get quick feedback on a certain turn of phrase or sentence construction, lest something wrong or awkward get imprinted in my developing grammar.

So, I guess my final point is that there are valid uses for essay-grading software, even for those of us teaching at institutions that won’t ever demand its use en masse. Rather than condemn it wholesale, we–and by we, I mean every college, program, professor, and lecturer–should figure out how to adapt to it and use it to our advantage.