Distant Reading and the “Evolution” Metaphor

1

Are there any corpora that purposefully avoid “diachronicity”? There are corpora that possess no meta-data about publication dates and whose texts are therefore organized by some other scheme—for example, the IMDB movie review corpus, which is organized according to positive/negative polarity; its texts, as far as I know, are not arranged chronologically or coded for time in any way. And there are cases where time-related data are not available, easily or at all. But have any corpora been compiled with dates—the time element—purposefully elided? Is time ever left out of a corpus because that information might be considered “noise” to researchers?

Maybe in rare situations. But for most corpora whose texts span any length of time greater than a year, the texts are, if possible, arranged chronologically or somehow tagged with date information. In this universe, time flows in one direction, so assembling hundreds or thousands of texts with meta-data related to their dates of publication means the resulting corpus will possess an inherent diachronicity whether we want it to or not. We can re-arrange the corpus for machine-learning purposes, but the “time stamp” is always there, ready to be explored. Who wouldn’t want to explore it?

If we have a lot of texts—any data, really—that span a great length of time, and if we look at features in those data across the time span, what do we end up studying? In nearly all cases, we end up studying patterns of formal change and transformation across spans of time. The “evolution” metaphor suggests itself immediately. Be honest, now, you were thinking about it the minute you compiled the corpus.

One can, of course, use “evolution” as a general synonym for change. This is probably the case for Thomas Miller’s The Evolution of College English and for many other studies whose data extend only to a limited number of representative sources. However, when it comes to distant readings, the word becomes much more tempting. The trees of Moretti’s Graphs, Maps, Trees are explicitly evolutionary:

For Darwin, ‘divergence of character’ interacts throughout history with ‘natural selection and extinction’: as variations grow apart from each other, selection intervenes, allowing only a few to survive. In a seminar a few years ago, I addressed the analogous problem of literary survival, using as a test case the early stages of British detective fiction . . . (70-71)

The same book ends with an afterword by geneticist Alberto Piazza (who worked with Luigi Luca Cavalli-Sforza on The History and Geography of Human Genes). Piazza writes:

[Moretti's writings] struck me by their ambition to tell the ‘stories’ of literary structures, or the evolution over time and space of cultural traits considered not in their singularity, but their complexity. An evolution, in other words, ‘viewed from afar’, analogous at least in certain respects to that which I have taught and practiced in my study of genetics. (95)

Analogous at least in certain respects . . . For Moretti and Piazza, literary evolution is not just a synonym for change in literature. Biological evolution becomes a guiding metaphor (not perfect, by any means) for the processes of formal change analyzed by Moretti. Piazza continues:

The student of biological evolution is especially interested in the root of a [phylogenetic] tree (the time it originated). . . . The student of literary evolution, on the other hand, is interested not so much in the root of the tree (because it is situated in a known historical epoch) as in its trajectory, or metamorphoses. This is an interest much closer to the study of the evolution of a gene, the particular nature of whose mutations, and the filter operated by natural selection, one wants to understand . . . (112-113)

Obviously, for Piazza, Moretti’s study of changes to and migrations of literary form in time and space evokes the processes and mechanisms of biological evolution—there’s not a one-to-one correspondence, of course, and Piazza points this out at length, but the similarities are evocative enough that he, a population geneticist, felt confident publishing his thoughts on the subject.

In Distant Reading, Moretti has more recently acknowledged that the intense data collection and quantitative analysis that has marked work at Stanford’s Literary Lab must at some point heed “the need for a theoretical framework” (122). Regarding that framework, he writes:

The results of the [quantitative] exploration are finally beginning to settle, and the un-theoretical interlude is ending; in fact, a desire for a general theory of the new literary archive is slowly emerging in the world of digital humanities. It is on this new empirical terrain that the next encounter of evolutionary theory and historical materialism is likely to take place. (122)

In Macroanalysis, Matthew Jockers also acknowledges (and resists) the temptation to initiate an encounter between evolutionary theory and the quantitative, diachronic data compiled in his book:

. . . the presence of recurring themes and recurring habits of style inevitably leads us to ask the more difficult questions about influence and about whether these are links in a systematic chain or just arbitrary, coincidental anomalies in a disorganized and chaotic world of authorial creativity, intertextuality, and bidirectional dialogics . . .

“Evolution” leaps to mind as a possible explanation. Information and ideas do behave in a ways that seem evolutionary. Nevertheless, I prefer to avoid the word evolution: books are not organisms; they do not breed. The metaphor for this process breaks down quickly, and so I do better to insert myself into the safer, though perhaps more complex, tradition of literary “influence” . . . (155)

And in the last chapter to Why Literary Periods Mattered, Ted Underwood does not mention evolution at all but there is clearly an evolutionary connotation to the terms he uses to describe digital humanities’ influence on literary scholars’ conception of history:

. . . digital and quantitative methods are a valuable addition to literary study . . . because their ability to represent gradual, macroscopic change brings a healthy theoretical diversity to literary historicism . . .

. . . we need to let quantitative methods do what they do best: map broad patterns and trace gradients of change. (159, 170)

Underwood also discusses “trac[ing] processes of change” (160) and “causal continuity” (161). The entire thrust of Underwood’s argument, in fact, is that distant or quantitative readings of literature will force scholars to stop reading literary history as a series of discrete periods or sharp cultural “turns” and to view it instead as a process of gradual change in response to extra-literary forces—”Romanticism” didn’t just become “Naturalism” any more than homo erectus one decade decided to become homo sapiens.

Tracing processes of gradual, macroscopic change . . . if that doesn’t invoke evolutionary theory, I don’t know what does. Underwood doesn’t even need to use the word.

Moretti, Jockers, and Underwood are three big names in digital humanities who have recognized, either explicitly or implicitly, that distant reading puts us face to face with cultural transformation on a large, diachronic scale. Anyone working with DH methods has likely recognized the same thing. Like I said, be honest: you were already thinking about this before you learned to topic model or use the NLTK.

 

2

Human culture changes—its artifacts, its forms. This is not up for debate. Even if we think human history is a series of variations on a theme, the mutability of cultural form remains undeniable, even more undeniable than the mutability of biological form. Distant reading, done cautiously, gives us a macro-scale, quantitative view of that change, a view simply not possible to achieve at the scale of individual texts or artifacts. Given the fact of cultural transformation, then, and DH’s potential to visualize it, to quantify aspects of it, one of two positions must be taken.

1. The diachronic patterns we discover in our distant readings are, to use Jockers’ words, “just arbitrary, coincidental anomalies in a disorganized and chaotic world of authorial creativity, intertextuality, and bidirectional dialogics.” Theorizing the patterns is a fool’s errand.

2. The diachronic patterns we discover are not arbitrary or random. Theorizing the patterns is a worthwhile activity.

Either we believe that there are processes guiding cultural change (or, at least, that it’s worthwhile to discover whether or not there are such processes) or we assume a priori that no such processes exist. (A third position, I suppose, is to believe that such processes exist but we can never know them because they are too complex.) We can all decide differently. But those who adopt the first position should kindly leave the others to their work. In my view, certain criticisms of distant reading amount to an admonition that “What you’re trying to do just can’t be done.” We’ll see.

 

3

When we decide to theorize data from distant readings, what are we theorizing? Moretti, Jockers, and Underwood each provide a similar answer: we are theorizing changes to a cultural form over time and, in some instances, space. Certain questions present themselves immediately: Are the changes novel and divergent, or are they repeating and reticulating? Is the change continuous and gradual, or are there moments of punctuated equilibrium? How do we determine causation? Are purely internal mechanisms at work, or also external dynamics? A complex interplay of both internal mechanisms and external dynamics? How do we reduce data further or add layers of them to untangle the vectors of causation?

To me, all of this sounds purely evolutionary. Even talking about gradual vs. quick change is a discussion taken right out of Darwinian theory.

But we needn’t adopt the metaphor explicitly if we are troubled that it breaks down at certain points. Alex Reid writes:

Matthew Jockers remarks following his own digital-humanistic investigation, “Evolution is the word I am drawn to, and it is a word that I must ultimately eschew. Although my little corpus appears to behave in an evolutionary manner, surely it cannot be as flawlessly rule bound and elegant as evolution” (171). As he notes elsewhere, evolution is a limited metaphor for literary production because “books are not organisms; they do not breed.” He turns instead to the more familiar concept of “influence” . . . Certainly there is no reason to expect that books would “breed” in the same way biological organisms do (even though those organisms reproduce via a rich variety of means). [However], if literary production were imagined to be undertaken through a network of compositional and cognitive agents, then such productions would not be limited to the capacity of a human to be influenced. Jockers may be right that “evolution” is not the most felicitous term, primarily because of its connection to biological reproduction, but an evolutionary-type process, a process as “natural” as it is “cultural,” as “nonhuman” as it is “human,” may exist.

An “evolutionary-type” process of culture is what we’re after, one that is not necessarily reliant on human agency alone. Will it end up being “flawlessly rule bound and elegant as evolution”? First, I think Jockers seriously over-estimates the “flawless” nature of evolutionary theory and population genetics. If the theory of evolution is so flawless and elegant, and all the science settled, what do biologists and geneticists do all day? Here’s a recent statement from the NSF:

Understanding the tree of life has been a goal of evolutionary biologists since the time of Darwin. During the past decade, unprecedented gains in gathering and analyzing phylogenetic data have demonstrated increasingly complex genealogical patterns.

. . . . Our current knowledge of processes such as hybridization, endosymbiosis and lateral gene transfer makes clear that the evolutionary history of life on Earth cannot accurately be depicted as a single, typological, bifurcating tree.

Moretti, it turns out, needn’t worry so much about the fact that cultural evolution reticulates. And Jockers needn’t assume that biological evolution is elegantly settled stuff.

Secondly, as Reid argues, we needn’t hope to discover a system of influence and cultural change that can be reduced to equations. We probably won’t find any such thing. However, within all the textual data, we can optimistically hope to find regularities, patterns that can be used to make predictions about what might be found elsewhere, patterns that might connect without casuistic contrivance to theories from the sciences. Here’s an example, one I’ve used several times on this blog: Derek Mueller’s distant reading of the journal College Composition and Communication. Mueller used article citations as his object of analysis. When he counted and graphed a quarter century of citations in the journal, he discovered patterns that looked like this:

muellerlongtail

Actually, based on similar studies of academic citation patterns, we could have predicted that Mueller would discover this power law distribution. It turns out that academic citations—a purely cultural form, a textual artifact constructed through the practices of the academy—behave according to a statistical law that seems to affect all sorts of things, from earthquakes to word frequencies. This example makes a strong case against those who argue that cultural artifacts, constructed by human agents within their contextualized interactions, will not aggregate over time into scientifically recognizable patterns.  Granted, this example comes from mathematics, not evolutionary theory, but it makes the point nicely anyway: the creations of human culture are not necessarily free from non-human processes. Is it foolish to look for the effects of these processes through distant reading?

 

4

“Evolution,” “influence,” “gradualism”—whatever we call it in the digital humanities, those of us adopting it on the literary and rhetorical end have a huge advantage over those working in history: we have a well-defined, observable element, an analogue of DNA, to which we can always reduce our objects of study: words. If evolution is going to be a guiding metaphor, we need this observable element because it is through observations of its metamorphoses (in usage, frequency, etc.) that we begin to figure out the mechanisms and dynamics that actually cause or influence those metamorphoses. If we had no well-defined segment to observe and quantify, the evolutionary metaphor could be thrown right out.

To demonstrate its importance, allow me a rhetorical demonstration. First, I’ll write out Piazza’s description of biological evolution found in his afterword to Graphs, Maps, Trees. Then, I’ll reproduce the passage, substituting lexical and rhetorical terms for “genes” but leaving everything else more or less the same. Let’s see how it turns out:

Recognizing the role biological variability plays in the reconstruction of the memory of our (biological) past requires ways to visualize and elaborate data at our disposal on a geographical basis. To this end, let us consider a gene (a segment of DNA possessed of a specific, ascertainable biological function); and for each gene let us analyze its identifiable variants, or alleles. The percentage of individuals who carry a given allele may vary (very widely) from one geographical locality to another. If we can verify the presence or absence of that allele in a sufficient number of individuals living in a circumscribed and uniform geographical area, we can draw maps whose isolines will join all the points with the same proportion of alleles.

The geographical distribution of such genetic frequencies can yield indications and instruments of measurement of the greatest interest for the study of the evolutionary mechanisms that generate genetic differences between human populations. But their interpretation involves quite complex problems. When two human populations are genetically similar, the resemblance may be the result of a common historical origin, but it can also be due to their settlement in similar physical (for example, climactic) environments. Nor should we forget that styles of life and cultural attitudes of an analogous nature (for example, dietary regimes) can favour the increase or decrease to the point of extinction of certain genes.

Why do genes (and hence their frequencies) vary over time and space? They do so because the DNA sequences of which they are composed can change by accident. Such change, or mutations, occurs very rarely, and when it happens, it persists equally rarely in a given population in the long run . . . From an evolutionary point of view, the mechanism of mutation is very important because it introduces innovations . . .

. . . The evolutionary mechanism capable of chancing the genetic structure of a population most swiftly is natural selection, which favours the genetic types best adapted for survival to sexual maturity, or with a higher fertility. Natural selection, whose action is continuous over time, having to eliminate mutations that are injurious in a given habitat, is the mechanism that adapts a population to the environment that surrounds it. (100-101)

Now for the “distant reading” version:

Recognizing the role lexical variability plays in the reconstruction of the memory of our (literary and rhetorical) past requires ways to visualize and elaborate data at our disposal on the basis of cultural space (which often correlates with geography). To this end, let us consider a word (a segment of phonemes and morphemes possessed of a specific, ascertainable grammatical or semantic function); and for each word let us analyze its stylistic variants, or synonyms. The percentage of texts that carry a given stylistic variant may vary from one cultural space to another, or from one genre to the other. If we can verify the presence or absence of that variant in a sufficient number of texts produced in a circumscribed and uniform cultural space we can draw maps whose isolines will join all the points with the same proportion of stylistic variants.

The distribution of such lexical frequencies can yield indications and instruments of measurement of the greatest interest for the study of the evolutionary mechanisms that generate lexical differences between “generic populations.” But their interpretation involves quite complex problems. When two rhetorical forms or genres are lexically similar, the resemblance may be the result of a common historical origin, but it can also be due to their development in similar geographic or political environments. Nor should we forget that styles of life and cultural attitudes of an analogous nature (for example, religious dictates) can favour the increase or decrease to the point of extinction of certain lexical items or clusters of lexical items.

Why do words (and hence their frequencies and “clusterings”) vary over time and space? They do so because of stylistic innovations. Such innovation occurs very rarely, and when it happens, it persists equally rarely in a given generic population in the long run . . . From an evolutionary point of view, the mechanism of innovation is very important because it introduces new rhetorical forms . . .

. . . The evolutionary mechanism capable of changing the lexical structure of a rhetorical form or genre most swiftly is cultural selection, which favours the forms best adapted for survival to publication and circulation, or with a higher degree of influence (meaning a higher likelihood of being reproduced by others without too many changes). Cultural selection, whose action is continuous over time, having to eliminate rhetorical innovations or “mutations” that are injurious in a given cultural habitat, is the mechanism that adapts a rhetorical form to the environment that surrounds it.

Obviously, it’s not perfect. I leave it to the reader to decide its persuasive potential.

I think the biggest problem is in the handling of mutations. In biological evolution, genes mutate via chance variations during replication of their segments; these mutations can introduce innovations in an organism’s form or function. In literary evolution, however, no sharp distinction exists between a lower-scale “mutation” and the innovation it introduces. The innovation is the formal mutation. This issue arises because, in literary evolution, as in linguistic evolution, the genotype/phenotype distinction is not as obvious or strictly scaled as it is in evolutionary theory. Words are more phenotype than genotype, unless we want to get lost in an overly complex evocation of morphology and phonology.

The metaphor always breaks down somewhere, but where it works, it is, I think, highly suggestive: the idea is that we track rhetorical forms—constellations of words and their stylistic variants—across time and space, in order to see where the forms replicate and where they disappear. Attach meta-data to the texts that constitute those forms, and we will have what it takes to begin making data-driven arguments about how cultural ecology affects or does not affect cultural form.

It’s an interesting framework in which distant reading might go forward, even if explicit uses of the word “evolution” are abandoned.

Graphing Citations and Making Sense of Disciplinary Divisions

A Pareto distribution: the troubling result of Derek Mueller’s distant reading of citations in College Composition and Communication: a “long tail” of citations, a handful of names cited many times but exponentially more names cited only once. Out of 8,035 unique citations, 5,761 were cited once and 986 were cited twice. In other words, 84% of citations in CCC occurred only once or twice in a 25 year period.

Troubling, but unsurprising. Physical and social scientists have long known that power law distributions occur across a wide variety of phenomena, including academic citations (Gupta et al. 2005). That a long tail occurs in a rhet/comp journal simply puts our discipline in the same position as everyone else: a small group of scholarly work has gained a “cumulative advantage” or “preferential attachment” and thus become the core set of classic texts recognized by the field, while most other scholars fail to produce texts that cross the tipping point toward their own preferential attachment. It is usually assumed that this core group of scholars is what unites a discipline. To some extent, the assumption is probably true. However, Mueller is right to ask how far a citation trail can lead away from that core group of scholars before we start questioning just how unified a discipline really is.

When graphing citation counts, it’s not problematic to discover a steep drop between the most cited scholar and the tenth most cited scholar; nor is it problematic that most sources are cited infrequently. The problem is not the long tail. The problem, in CCC’s case, is that the long tail very rapidly approaches a value equal to one. This indicates that any given source in CCC is valuable to the scholar citing it but effectively worthless to everybody else who publishes in the journal. If most citations occurred three, four, five times, even that would suggest a certain unity of purpose—what one scholar has found valuable, several others have found valuable as well, in various issues and various contexts. But when the long tail is mostly comprised of sources cited once and never again? That requires a more robust explanation than a nod toward a core group of scholars can provide. Mueller thus raises the right question:

Although we do not at this time have data from all of the major journals to investigate this fully, the changing shape of the graphed distribution reiterates more emphatically a question only hinted at . . . but one nevertheless crucial to the idea of a common disciplinary domain: How flat can the citation distribution become before it is no longer plausible to speak of a discipline?

To answer Mueller’s call for more data, I have compiled article abstracts from CCC and two other major journals in the field—Rhetoric Society Quarterly and Rhetoric Review. I intend this post to serve as a tentative response to the question posed by Mueller at the end of this quote.  The CCC abstracts run from February 2000 (51.3) to September 2011 (63.1), a total of 261 abstracts. The RSQ abstracts run from Winter 2000 (30.1) to Fall 2011 (41.5), a total of 220 abstracts. The RR abstracts run from 2002 (21.3) to 2011 (30.4), a total of 154 abstracts.

Only abstracts, not full articles. However, because only the most important citations appear in abstracts, I think tallying abstract citations offers the best chance to shorten the long tail and partially alleviate the implications of Mueller’s work. It is not a slight to the humanities to point out that articles demand more citations than their arguments actually require: many article citations can be removed without affecting anything vital to an argument. Citations in abstracts, on the other hand, are in most cases central to the argument or study undertaken. If we count only the most important sources in each journal—the ones that surface in abstracts—is the long tail of citation distributions less pronounced? We can expect to discover a long tail. That’s a mathematical inevitability. But if a journal—to say nothing of an entire discipline—is somehow unified, citations in abstracts should have a slightly less extreme power law distribution than citations in the articles themselves. Abstract citations are the “cream of the crop,” those vital enough to make it into the space constraints of the abstract genre: we hope to find fewer citations and therefore a graph that does not drop so precipitously toward x=1.

Methods: Each corpus was uploaded to the Natural Language Toolkit and tagged for part of speech. Then I compiled proper nouns. The proper noun list was larger than but included proper names. I extracted these names—noun forms (e.g. ‘Burke’ or ‘Burke’s) and adjective forms (e.g. ‘Burkean’)—and tracked them across the abstracts. I compiled each unique citation as well as the number of times each was cited in an abstract.

Finding citation names

Finding citation names

Here are spreadsheets with the unique citations and their citation counts in each abstracts corpus: College Composition and Communication. Rhetoric Society Quarterly. Rhetoric Review.

There are 79 unique citations in the CCC abstracts; 159 unique citations in the RSQ abstracts; and 121 unique citations in the RR abstracts. Only six citations occur in both the RSQ and CCC abstracts corpora: Mina Shuaghnessy, Kenneth Burke, John Dewey, Donald Davidson, Peter Elbow, and Mikhail Bakhtin. When factoring in RR, only Kenneth Burke, John Dewey, and Peter Elbow are shared across all three corpora. RR and RSQ share quite a few sources, almost all of which are historical figures—Plato, Aristotle, Cicero, Isocrates, and the like. Kenneth Burke is the most frequently cited source in each abstracts corpus: he is cited in 5 separate abstracts in CCC, 17 in RSQ, and 14 in RR. Maybe “rhetoric and composition” should be changed to “Burkean studies.” No surprise—the man has his own journal.

Based on the raw count of unique citations in each journal—on average, less than one per abstract—I think my original suggestion is at least partially correct: counting citations in abstracts controls for the rhetorical demand of articles to cite more sources than necessary. Abstract citations are the stars of the show. Nevertheless, after graphing the citations, Pareto distributions did emerge:

CCC abstract citations

CCC abstract citations

RSQ abstract citations

RSQ abstract citations

RR abstract citations

RR abstract citations

Citations in the CCC abstracts occurred in a slightly more even distribution than citations in CCC articles (c.f., Mueller). But then, there aren’t many citations in this corpus, relative to the RSQ and RR corpora. Among the citations that do appear, none occur in numbers much greater than those occurring in only one abstract. The citation occurring most frequently—Burke—occurs in five abstracts. Does this graph confirm Mueller’s conclusion about a dappled CCC? To some extent, yes. There’s still a long tail, after all . . .

RSQ citations even more obviously display the Pareto distribution discussed in Mueller’s article. The citations occurring most frequently—Burke and Plato—surface in 17 and 14 abstracts, respectively.

The distribution in RR is also uneven, and the drop of the long tail is even more precipitous than the one in RSQ. Burke is cited in 14 abstracts and the next most frequent source, Aristotle, is cited in 5 abstracts.

These graphs indicate that even in article abstracts—where only the most vital sources are invoked—a small canon of core scholars emerges beside an otherwise long, flat, dapple distribution of citations. More divergence and specialization, then—not just in CCC but in RR and RSQ.

I think there’s more to it than disciplinary divergence, however. These long tails can undoubtedly be explained mathematically—the conclusion: they’re inevitable—but in this particular case they might also be explainable in prosaic terms. And I believe this prosaic explanation makes sense of the long tail in a way that salvages a shred of disciplinary unity within each journal:

In RR and RSQ, for example, an obvious citation pattern emerges. Five of the ten most cited sources in the RSQ abstracts are historical figures: Plato, Aristotle, Quintilian, Blair, and Cicero. In RR, the exact same thing: Aristotle, Cicero, Isocrates, Plato, Quintilian. But glancing through the long tail in both citation counts, historical figures continue to emerge, mostly from the Greco-Roman world, but from beyond it, as well. In the CCC long tail, on the other hand, historical figures occur in less frequent numbers, and only two pre-19th century.

Raw numbers for RR and RSQ: 27 (or 22%) of the RR citations are sources from the 17th century or earlier. 26 (or 16%) of RSQ citations are from the same period. Most are Greco-Roman sources, but Confucius, Montaigne, and Averroes are also scattered throughout the long tail. We might conclude, then, that a decently sized community of historians of rhetoric communicate in RSQ and RR (when they’re not communicating in Rhetorica, presumably). Their communication adds to the long tail, but does it signify disciplinary divergence and specialization?

Rather, here is one disciplinary community—historians of rhetoric—mapped out in unity. Its borders extend slightly into CCC but its principal territory lies in RSQ and RR. An obvious outcome, if you’re involved in the field. However, it also helps us make partial sense of that worrying Pareto distribution: not all of the singular citations that constitute the long tail are as disconnected as the graphs lead us to believe. In RSQ and RR, many singular citations could be grouped together: Plutarch, Laertius, Strabo, Aristophanes—these are, at least, not as indicative of a dappled disciplinary identity as, say, St. Paul and Steven Mailloux.

The same point can be made with pedagogy in the CCC abstracts. It is not surprising, of course, that CCC is home to scholars citing pedagogically-inclined sources; however, for a second time, this obvious point helps make sense of the Pareto distribution of citations presented here and in Mueller’s article: Charles Pierce, Mina Shuaghnessy, Melvin Tolson, Les Perelman—each appears only once, scattered throughout the long tail of abstract citations. But each is invoked for its direct relevance to writing pedagogy. Viewed in this way, the flat distribution of citations seems a little less dappled.

Robo-Graders

I was wrong about the mechanization of student writing. I had assumed another year or two would pass before MOOCs began utilizing essay grading software. Turns out it’s happening now. EdX, founded by Harvard and probably the most prestigious online course program, has anounced that it will implement its own assessment software to grade student writing.

Marc Bousquet’s essay successfully mines the reasons why humanities profs are anxious about algorithmic scoring. The reality is, across many disciplines, the writing we ask our students to do is “already mechanized.” The five-paragraph essay, the research paper, the literature review . . . these are all written genres with well-defined parameters and expectations. And if you have parameters and expectations for a text, it’s quite easy to write algorithms to check whether the parameters were followed and the expectations met.

The only way to ensure that a written product cannot be machine graded is to ensure that it has ill-defined parameters and vague or subjective expectations. For example, the expectations for fiction and poetry are highly subjective—dependent, ultimately, on individual authors and the myriad reasons why people enjoy those authors. It might be possible to machine grade a Stephen King novel on its Stephen-King-ness (based on the expected style and form of a Stephen King novel), but otherwise, it will remain forever impossible to quantitatively ‘score’ novels qua novels or poems qua poems, and there’s no market for doing that anyway. Publishers will never replace their front-line readers and agents with robots who can differentiate good fiction from bad fiction.

However, when we talk about student writing in an academic context, we’re not talking about fiction or poetry. We’re talking about texts that are highly formulaic and designed to follow certain patterns, templates, and standardized rhetorical moves. This description might sound like fingernails on a chalkboard to some, but look, in the academic world, written standards and expectations are necessary to optimize for the clearest possible communication of ideas. The purpose of lower division writing requirements is to enculturate students into the various modes of written communication they are expected to follow as psychologists, historians, literary critics, or whatever.

Each discourse community, each discipline, has its own way of writing, but the differences aren’t anywhere near incommensurable (the major differences exist across the supra-disciplines: hard sciences, soft sciences, social sciences, humanities). No matter the discipline, however, there is a standard way that members of that discipline are expected to write and communicate—in other words, texts in academia will always need to conform to well-defined parameters and expectations. Don’t believe it? One of the most popular handbooks for student writers, They Say/I Say, is a hundred pages of templates. And they work.

So what’s my point? My point is that it’s very possible to machine-grade academic writing in a fair and useful way because academic writing by definition will have surface markers that can be checked with algorithms. Clearly, the one-size-fits-all software programs, like the ones ETS uses, are problematic and too general. Well, all that means is that any day now, a company will start offering essay-grading software tailor-made for your own university’s writing program, or psychology department, or history department, or Writing Across the Curriculum program, or whatever—software designed to score the kind of writing expected in those programs. Never bet against technology and free enterprise.

And that’s another major point—there’s not a market for robot readers at publishing firms, but there certainly is a market for software that can grade student writing. And wherever there’s a need or a want or some other exigence, technology will fill the void. The exigence in academia is that there are more students than ever and less money to pay for full-time faculty to teach these students. Of course, this state of affairs isn’t an exigence for the Ivy League, major state flagships, or other elite institutions—these campuses are not designed for the masses. The undergraduate population at Yale hasn’t changed since 1978. A few years ago, a generous alumnus announced his plans to fund an increase in MIT’s undergraduate body—by a whopping 250 students. Such institutions will continue to be what they are: boutique experiences for the future elite. I imagine that Human-Graded Writing will continue to be a mainstay at these boutique campuses, kind of like Grown Local stickers are a mainstay of Whole Foods.

For the vast majority of undergraduates—those at smaller state colleges, online universities, or those trying to graduate in 4 years by taking courses through EdX—machine-grading will be an inevitable reality. Why? It fulfills both exigencies I mentioned above. It allows colleges to cut costs while simultaneously making it easier to get more students in and out of the door. Instead of employing ten adjuncts or teaching associates to grade papers, you just need a single tenure-track professor who posts lectures and uploads essays with a few clicks.

So, the question for teachers of writing (the question for any professors who value writing in their courses) is not “How can we stop machine-grading from infiltrating the university?” It’s here. It’s available. Rather, the question should be, “How can we best use it?”

Off the top of my head . . .

Grammar, mechanics, and formatting. Unless we’re teaching ESL writing or remedial English, these aspects tend to get downplayed. I know I rarely talk about participial clauses or the accusative case. I overlook errors all the time, focusing instead on higher-order concerns—say, whether or not a secondary source was really put to use or just quoted to fill a requirement. However, I don’t think it’s a good thing that we overlook these errors. We do so because there are only so many minutes in a class or a meeting. With essay-grading software, we can bring sentence-level issues to students’ attention without taking time away from higher-order concerns.

Quicker response times for ESL students, and, perhaps, more detailed responses than a single instructor could provide, especially if she’s teaching half-a-dozen courses. Anyone who has tried to learn a second language knows that waiting a week or two for teacher feedback on your writing is a drag. In my German courses, I always wished I could get quick feedback on a certain turn of phrase or sentence construction, lest something wrong or awkward get imprinted in my developing grammar.

So, I guess my final point is that there are valid uses for essay-grading software, even for those of us teaching at institutions that won’t ever demand its use en masse. Rather than condemn it wholesale, we–and by we, I mean every college, program, professor, and lecturer–should figure out how to adapt to it and use it to our advantage.

Technology and the empirical study of writing

A materialist theory of literary form will ultimately have to concern itself with the organic processes of reading and composition, but the way to do this is through empirical study of readers and writers, not more interpretation of texts, or armchair ruminations.  –Cozma Shalizi in a response to Franco Moretti’s Graps, Maps, Trees (128)

ancientwriting

Janet Emig initiated the writing process movement when she published The Composing Processes of Twelfth Graders, an attempt to study writing in an empirical way (lower case e; no Lockean baggage implied) by closely observing and polling several high school seniors as they wrote essays. Today, the shortcomings of her study are obvious—the sample size was small, and she had no way to track granular textual changes as they were made in real time. However, despite its limitations, Emig’s work introduced an important assumption to the field of writing studies:

Writing is a natural, organic phenomenon that can be studied empirically.

Unfortunately for Emig and the process movement, writing studies was and is situated in an academic context that requires a sharp pedagogical focus, and the empirical study of writing has little to no educational value. Studying writers can tell us how people write, but it doesn’t necessarily tell us how to teach writing, academic or creative or otherwise.

Intimately tied to the pedagogical critique of the process movement was the political critique. In early studies of writing processes, certain contextual elements (read: race, gender, class) were ignored. Emig, for example, did not deeply address the racial differences of her subjects. Critics claimed that the study of writing processes would not pay enough attention to relevant cultural factors that affect how, where, and why people write. This critique was weak, however, because all empirical pursuits must by design bracket out certain contextual elements in its early stages. As the pursuit progresses and gathers knowledge, the causes and effects (if any) of various contextual factors can be coded and controlled for. Race, class, and gender are such factors—important ones at that—but we needn’t stop there (cross-linguistic differences would be first and foremost on my mind). Dozens of material factors must be taken into consideration when studying writing. Had the process movement not been abandoned, researchers would have gotten around to controlling for all of them.

Then there was the philosophical critique. The goal of studying writing is to build evidence-based theories about this unique human practice from a variety of angles—stylistic, material, cognitive, neuronal, linguistic—and eventually to see how these levels interact. (E.g., What areas of the brain are operational at various stages of writing and re-writing? Are small stylistic changes or large organizational changes more often influential of a text’s shape?  How does textual cohesion emerge? What roles do vision and memory play in the way writers work with their texts on word processors? Do writers across languages and writing systems have completely different processes?)  However, like many humanities disciplines, writing studies has been influenced by postmodernism and is thus adverse to data-driven, quantitative, empirical methods, and not interested in questions—like the ones above—that require these methods. Gary Olson typifies the philosophical critique when he writes that the process movement attempts to “systematize something [writing] that simply is not susceptible to systematization.”

Of course, no evidence is provided for this claim—but then, none is needed. It isn’t a claim at all. It is an a priori assumption, a statement of faith, designed to obviate any empirical work on the subject.

Ironically, the most valid critique of the process movement in writing studies was one that no one ever made: the technological critique. In the 1980s, when the process movement was jostling for academic legitimacy, researchers simply did not have access to the technology that could enable a more robust inquiry into the material, organic processes of writing.

Today, we do have access to that technology. What’s more, the postmodern zeitgeist has waned in its influence, and the political critique was never quite valid. The time seems right for a return, not necessarily to process theory, but to the assumptions it made about data-driven, quantitative, empirical studies of the way humans compose.

Scholars like Chris Anson, Richard Haswell, and Chuck Bazerman are leading the way. Haswell’s call for “RAD” research—replicable, aggregable, data-supported—is essentially a call for empiricism in writing studies. And Anson’s recent report on the use of eye-tracking devices to study writers at work demonstrates how new technologies will enable and benefit this empirical endeavor. Anson’s line of research could lead to major insights into the ways writers access their ‘textual memory’ in order to manage the many semantic strands that comprise any written text. Indeed, this is a perfect example of how technology and RAD methods can test old ideas in writing studies, confirm or complicate those ideas, and fill them in with data-driven details. In 1999, for example, Christina Haas wrote about the way writers manage their texts in Writing Technology: Studies in the Materiality of Literacy:

Clearly, writers interact constantly, closely, and in complex ways with their own texts. Through these interactions, they develop some understanding—some representation—of the text they have created or are creating . . . As the text gets longer and more ideas are introduced and developed, it becomes more difficult to hold an adequate representation in memory of that text, which is out of sight. (117, 121, qtd in Brooke’s Lingua Fracta)

In 2012, enter the eye-tracking software, which can show us where writers look, physically, to develop representations of their texts as they are constructed: What kinds of words or phrases do writers reference most often, as ‘anchors’ for their intellectual wanderings? What are the outside limits of textual vision? Where do writers focus their vision at different stages in the writing process—choosing words, writing sentences, organizing paragraphs, et cetera? Do accomplished writers use their eyes differently than novice writers? Do high-IQ individuals use their eyes less or more while writing; are their visual memories more robust, requiring less visual tracking to make sense of their texts’ cohesion?

Without eye-tracking devices and empirical methodologies, researchers could never hope to answer these and other questions. They would never even think to formulate them.

Outside the field of writing studies, researchers are already using technology to capture and study authorial processes. An IBM study used an application called history flow to study contributions in Wikipedia articles and how numerous contributions endure, change, or are phased out entirely over time. Ben Fry built an amazing visualization of the multiple changes Darwin made to On the Origin of Species across six editions. And on a lesser scale, Timothy Weninger created a time-lapse video that shows the writing of a research paper in various stages (I’m in the process of figuring out how to do something similar, using track changes in Word).

The interest in organic authorial processes extends beyond writing studies, so it boggles my mind that writing studies scholars aren’t at the forefront of this research, which, I grant, is in its early stages. Luckily, things are looking up for RAD, empirical research in writing studies. Now that we can start grounding our theories of composition in real data, it’s only a matter of time before we start gaining empirical insight into this strange, relatively recent human behavior that we call ‘writing’.

Text Network and Corpus Analysis of the Unabomber Manifesto

Introduction

The Unabomber Manifesto—Industrial Society and its Future—was sent to major newspapers in 1995, with an accompanying promise from its author, Ted Kaczynski, to stop exploding things if someone printed the 35,000 word text in full. The New York Times and the Washington Post obliged in September of that year. The manifesto became a major clue in the hunt for the Unabomber, but only a few forensic linguists concluded that Kaczynski, a suspect at the time, had written it. The majority failed to see a connection between the manifesto and other writings by Kaczynski (these are the same people, I can only guess, who remain skeptical about who wrote Romeo and Juliet). In the end, none of it mattered anyway. Evidence found in Kaczynski’s cabin was far more damning than forensic linguistic analyses of the manifesto.

The Manifesto

You expect the manifesto of a domestic terrorist to be insane. Kaczynski is not your average domestic terrorist. A former Berkeley professor of mathematics with a Michigan PhD, Kaczynski could have feasibly published the essay with a legitimate press or magazine and gained a wide academic audience had he not retreated into the woods and his own head. The manifesto is a real argument that, minus its calls for violence, could have been inserted into a legitimate discourse, albeit one that would have resulted in criticism coming Ted’s way.

Ostensibly, the manifesto is a strong critique of contemporary techno-capitalist society. However, if you took a knife to the text, divided it into little passages, you would discover that half of them bend far leftward and could be read aloud without protest in Harvard Yard, while the other half bend far rightward and could only be read aloud without protest at Hillsdale College.

So, there are passages such as this one, which would send heads nodding in every humanities department in America:

The Industrial Revolution and its consequences have been a disaster for the human race. They have greatly increased the life-expectancy of those of us who live in “advanced” countries, but they have destabilized society, have made life unfulfilling, have subjected human beings to indignities, have led to widespread psychological suffering (in the Third World to physical suffering as well) and have inflicted severe damage on the natural world.

Then comes this curveball:

One of the most widespread manifestations of the craziness of our world is leftism, so a discussion of the psychology of leftism can serve as an introduction to the discussion of the problems of modern society in general.

Like many on the left, Kaczynski blames technology and The System for the sad state of the earth and its inhabitants, yet he suggests that the contemporary left (the “oversocialized” left, as Ted puts it) is in fact The System’s most malformed, though logical outgrowth.

At first, I couldn’t recognize the motive behind the manifesto. Its politics seemed too conflicted. Then I noticed a brief mention in Kaczynski’s Wikipedia article that ties him to the anarcho-primitive tradition, and suddenly the text became more philosophically cohesive.

The Manifesto’s Motive

There are two types of anarcho-primitivists: the Rousseau types and the Hobbes types (my own ad hoc terms). The former are human-centric and collectivist. They believe that dismantling techno-capitalist society will usher in an era of equality and harmony between men and women of all races. The latter are earth-centric and individualistic. They believe that dismantling techno-capitalist society will put a halt to overpopulation and environmental degradation, and allow individuals to live more spiritually and physically fulfilled lives.

The goals aren’t mutually exclusive, but nor are they necessarily aligned. (When it comes to immigration, they are outright opposed.) The Hobbesian primitivists tend to believe that nature, for all its beauty and desirability, isn’t a progressive utopia. Who are these Hobbesians? They are the Monkey Wrench Gang radicals, the Edward Abbeys and Doug Peacocks of the environmental movement, the Garret Hardins of ecology, the survivalists, the Timothy Treadwells, the (typically) men who love nature more than humanity but harbor no romanticism about either. Kaczynski would have gotten along well in the Monkey Wrench Gang, who held no love for humans or community or society in the aggregate because, to them, human communities are precisely the problem.

Let’s put these categorizations aside for now and look to the text of the manifesto itself. A text network analysis and an analysis with the Natural Language Toolkit (NLTK) can provide us with grounded data about Kaczynski’s motives as they appear in his manifesto. The motives of all authors—or at least their traces—are always left behind in the lexical choices of their texts. Deliberate, written language is like a rhetorical fingerprint.

Text Network Analysis

As I’ve discussed in other posts, a text network analysis proceeds in the following way: a text is copied into a .txt file; it is imported into some analytic tool (I use Auto Map) in order to remove stop words and to lightly stem the text; then, using the same tool, the text—which has now been expunged of all but significant content words—is run through an algorithm that treats the content words like a network and creates a co-reference list in .csv format. What words are connected to what other words, and how often? (In this analysis, I used a two word gap and a five word gap.) The .csv file is then opened in a network analysis tool (I use Gephi) in order to visualize these semantic connections. Each word is visualized as a node in the network, and words that appear next to each other—again, within a certain word gap—appear as edges.

The two most important network visualizations, in my opinion, show nodes with the highest levels of Betweenness Centrality and the highest levels of Degree Centrality. The latter measures how many total connections a node has to other individual nodes; so, a node with high degree centrality will simply be connected to many other nodes. The former measures whether or not a node is connected to other nodes that themselves have many connections; so, a node with high betweenness centrality will in essence be an important ‘passageway’ between communities within the network. (Here’s an excellent visual description of the concepts.)

In a textual network, a word with high degree centrality is a word used in connection with myriad other words. This simply tells you that a word is used frequently in a text and in a variety of contexts. A word with high betweenness centrality is a word used frequently and in conjunction with other words that also connect to other nodes to form community clusters. This tells you that a word is not only used frequently and not only in many contexts but also that it is used in connection with words that also do a lot of semantic work in the text. A word with high betweenness centrality is a word through which many meanings in a text circulate.

For example, as you see below, psychological has a high degree centrality in the Unabomber Manifesto but not a high betweenness centrality. This lexical item was therefore used frequently and connected to many different words, such as:

psychological techniques

psychological methods

However, the words to which psychological is connected (techniques and methods) do not themselves perform a lot of semantic work elsewhere in the text. Words like psychological are essentially productive creators of bigrams but not pathways of meaning.

Society, on the other hand, not only has a high degree centrality but also a high betweenness centrality. So, the words that it connects to also have further connections and thus do perform semantic work elsewhere in the text.

Here are the text network visualizations:

Nodes with the highest Degree Centrality in the manifesto

Nodes with the highest Degree Centrality in the manifesto

Nodes with the highest Betweenness Centrality in the manifesto

Nodes with the highest Betweenness Centrality in the manifesto

The text is long, so its network is messy. In the 5-word gap network, the manifesto had over 200 separate meaning clusters. In the 2-word gap (seen above), it still had over 150 clusters.

Social, society, people, and human are the words with the highest levels of degree centrality in Kaczynski’s manifesto. Also visible in this network are technology and its derivations, psychological, system, freedom, physical, power, leftist, and modern.

Social, society, and people are the words with the highest levels of betweenness centrality in Kaczynski’s manifesto. Also visible in the network are human, problems, system, change, and natural.

As I mentioned earlier, most commentary on the Unabomber manifesto focuses on a) its attack on technology, and b) its attack on leftism. However, as these text networks demonstrate, the words that do the most semantic work in the text—the words through which most meanings flow—suggest that Kaczynski’s sights were set on society as a whole—its people, its systems. Three other words with relatively many connections—psychological, power, and freedom—further suggest that the ostensible screed against leftism and technology masks a deeper motive that circulates in a diffuse, though nonetheless salient way throughout the text. And in the light of Kaczynski’s possible connection to an anarcho-primitivist tradition, these particularly noticeable nodes make much more sense than they would if we tried to paint him as a madman or, worse, a bitter, conservative academic. If he were only that, we might expect other terms to be more noticeable in the network (e.g., the various derivations of leftism).

One thing a text network does, beyond providing an interesting visualization, is to point the researcher in the direction of terms and n-grams that might be explored more granularly in a corpus analysis tool, such as the NLTK. It provides a map of a text’s semantic circulation, a map that can be followed when we return to the world of pure textuality.

Corpus Analysis

Here is a raw count of the most frequent words in the manifesto:

unabomberLineChart

Certain words weren’t visually important nodes in the text network but were nonetheless used frequently (e.g., goal/s, individual/s, process, industrial, way, work, man, behavior, control ); these words were deployed often but in conjunction with a limited number of other terms. Nevertheless, the 20 most frequent words signify a dual emphasis that makes sense if Kaczynski is a certain kind of primitivist: there is the left-wing emphasis on the ills of society, the system, technology, and control; but there is also the right-wing emphasis on individuals and freedom.

The NLTK can also generate a dispersion plot, which shows where in a text individual words fall. Here is a dispersion plot of the 10 most frequent words:

unabomberDispersionPlotTop10

A striking pattern emerges. Although much has been made of the manifesto’s condemnation of the left, the dispersion plot demonstrates that anti-leftism is not a continuous theme in the text but rather forms the bookends: the manifesto opens and closes with references to leftists, but the bulk of the text does not mention them at all. The focus is elsewhere.

The dispersion of technology and technological provides another striking pattern. More than a third of the text passes before Kaczynski begins to deploy these words in earnest, even though a surface reading of the text leaves the reader with the impression that technological anxieties anchor every aspect of the manifesto.

But compare the dispersion of these supposedly central terms—leftist/s, technology and technological—with the dispersion of other terms in the list. Society, system, people, power, human, and, to a lesser extent, modern all have much more uniform dispersions throughout the manifesto. In other words, these concepts appear more regularly and consistently in each of the manifesto’s 232 numbered paragraphs, and that is precisely what we should expect if Kaczynski is indeed a primitivist who loves nature more than humanity. His ire is most obviously directed at leftists, but more subtly, the motivated energy of his manifesto is pointed in all directions at all society in its malformed, destructive development.

Ranking Native American language health

I recently finished reading Ellen Cushman’s The Cherokee Syllabary, an excellent book on the history and spread of the writing system developed by Sequoyah for the Cherokee tribe. Cushman does a thorough job explaining how the syllabary works as a syllabary, rather than describing it in alphabetic terms. She argues that to explain a syllabary in terms of one-to-one sound-grapheme correspondence (which is often the tact in linguistic work) is already to analyze it in alphabetic terms.

One of Cushman’s central projects in the book is to demonstrate how the Cherokee syllabary—both its structure and graphic representation—grew from Cherokee culture. It was not, she argues, a simple borrowing and re-application of the Roman alphabetic script. Most scholars would disagree with her, including Henry Rogers in Writing Systems: A Linguistic Approach and Steven Roger Fischer in A History of Writing. Fischer claims that “using an English spelling book, [Sequoyah] arbitrarily appointed letters of the alphabet” to correspond with units of sound in Cherokee (287). Cushman counters this claim by pointing out that linguists only make it after looking at the printed form of Cherokee, which, by necessity, remediated Sequoyah’s original syllabary into a more Latinate form. Cushman provides us with pictures of the original syllabary, as well as a new Unicode font that she believes more adequately represents the original style:

Much of Cushman’s book is devoted to showing the connection between Cherokee culture and the syllabary, a connection which obviates the need to assume some sort of alphabetic borrowing.

I’m not at all convinced by this main argument (still lots of Latinate forms up there), but I was quite interested, after reading the book, in another point Cushman makes about what it means to be Native American, both historically and contemporarily. She posits “four pillars of Native peoplehood: language, history, religion, and place” (6). I would argue that language is the most powerful of the four, but Cushman merely claims that the loss of the Cherokee language would “spell the ruin of an integral part of Cherokee identity.”

No doubt it would. And this got me thinking about native language health in general. As regards Cherokee specifically, Cushman writes that “while the Cherokees are one of the largest tribes in the United States, the Cherokee Nation estimates that only a few thousand speak, read, and write the Cherokee language” (6). I checked this statistic and found it to be correct but misleading. Perhaps only a few thousand Cherokees “speak, read, and write” Cherokee, but 16,000 speak the language.

So what about other native languages? Using Ethnologue and the World Atlas of Language Structures, I ranked all native American languages (and a few Canadian languages) by their ‘linguistic health’, measured purely as number of speakers. Here’s a bar chart of native languages with more than 100 speakers. (Click to enlarge.) Already, you can notice the seriously skewed curve that I’ll discuss in a moment . . .

Now, no native language in America (or Canada) is ‘healthy’ compared to English, Spanish, Mandarin, Hindi, or the world’s other dominant languages. Nearly all native American languages are endangered, severely endangered, or extinct. Only one—Navajo—escapes the ‘endangered’ list, but even then, Navajo is lately considered ‘vulnerable’ because the youngest generation is switching to English.

Within this continuum of endangered native languages, however, there exists a highly skewed continuum of linguistic health. There are approx. 115 living languages in America, but only 35 possess more than 1,000 speakers. Only 9 possess more than 10,000 speakers. And only 3 possess more than 50,000 speakers. In other words, the great bulk of living native American languages are in bad shape, and will likely go extinct within the next generation, joining the 41 native languages that already have gone extinct. Here’s the ranking of native languages with fewer than 100 speakers:

And yet what interests me about this data is not the obvious point about language loss in our post-colonial present. Language loss is the inevitable outcome in the wake of conquest; Old English itself was lost when the Norman French invaded Britain. Rather, what interests me is that, extinction and severe endangerment being the rule, several languages have managed to become glaring exceptions to the rule. Why?

According to my list, there are approximately 454,515 native language speakers in America—and parts of Canada, since I’ve included Cree and Ojibwe, Canada’s healthiest native languages, in my list (see the end of this post for more methodological details). At the start of the colonial era, there were somewhere between 2 and 7 million natives living in what is now the U.S. and Canada, with most of that population inhabiting the U.S. Splitting the difference, we can say there were 4 .5 million native language speakers pre-conquest but only 454,515 today. That’s a nearly 90% reduction in native language speakers over the course of 500 years.

(Note: this is not the same as a reduction in population. There are currently 2.9 million native Americans in the U.S., which, depending on your source, is anywhere from a net gain in population between the 15th and 21st centuries, or a loss of around 50-60% total native population. The comparatively drastic loss in number of native language speakers, however,results from the fact that most native Americans have, both recently and historically, switched to English.)

Speaking of languages, then, not population, it seems as though total annihilation is the most probable outcome for a language after conquest. It seems almost inevitable that a conquered population’s language will eventually become the language of the conqueror. (This is why only 100,000 people speak Irish in Ireland, and why no one speaks an un-Romanized version of English.)

Thus, it’s not surprising that most native languages possess fewer than 1,000 speakers, or that more than half only have between 1 and 100 speakers—i.e., it’s not surprising that more than half of native American languages are practically extinct. If we ignore the nine ‘healthiest’ native languages (the outliers with more than 10,000 speakers), then the total reduction in native language speakers between pre-colonial times and today rapidly approaches 100%.

Which returns us to the interesting thing about this data: the existence of these (comparatively) healthy native American languages. The nine healthiest languages have a total of 368,259 speakers, which translates to 81% of all native language speakers across all tribal languages; and yet these nine languages comprise only 7% of all native languages. In other words, 81% of native language speakers in America and parts of Canada speak only 7% of the existing native languages (less than 4% of all native languages, if we factor extinct languages and all Canadian languages into the equation).

I imagine that if we look at any area on the globe where conquered indigenous languages jostle beside more powerful indigenous or colonial languages, we’ll find similar data showing that, even amongst the less powerful languages, there remains a very skewed hierarchy of linguistic health. One can’t help wondering what’s at work here . . .

I enjoy compiling large sets of data like this because certain questions just don’t come into sharp focus until we compile the data. I think most rhet/comp scholars, like Cushman, have a general understanding that certain native American languages are in better shape than others; however, until we take the time to work with the actual data set (all living and extinct native American languages), we won’t discover this skewed pattern within it, and we won’t be able to formulate what, to my mind, are highly interesting and relevant questions: why and how have certain languages managed to survive and (comparatively) thrive while most other native languages have gone extinct or dwindled to only a few hundred speakers? What did these languages and tribal groups have going for them that the others didn’t? Was it a purely linguistic advantage, a purely geopolitical advantage, or a combination of both?

In part, we can read Cushman’s book as an answer to these unformulated questions. While Cushman spends a lot of time (rightly) describing language attrition among contemporary Cherokees, she perhaps doesn’t realize that Cherokee is doing a hell of a lot better than most other native languages. Although her book presents something of a contrast between the language’s current weakened state and the syllabary’s historic role in uniting and strengthening the Cherokee against further Western encroachment, we can see, in light of this data, how the contrast is perhaps instead a partial explanation for the fact that Cherokee isn’t as unhealthy as the vast majority of native American languages. In other words, the existence of the Cherokee syllabary may very well be one of the reasons why Cherokee exists on the healthier side of living native languages, why Cherokee isn’t entirely extinct.

Stylizing Sequoyah’s thought process, Cushman writes, “If whites could have a writing system that so benefited them, filling them with self-respect and earning the respect of others, then Cherokees could have a writing system with all this power as well” (35). After compiling statistics on native language health, I can see that Cushman, in focusing on current language attrition among the Cherokee, misses a deeper exploration of a compelling possibility: that the syllabary’s power not only bolstered the Cherokee people but also perhaps played a part in saving the Cherokee language itself from total extinction. The syllabary’s strengthening role was not an historic phenomenon; without it, perhaps there wouldn’t be a Cherokee language today at all.

This is a good example of why I think digital tools and databases have a lot to offer the humanities: without them, patterns go unnoticed and questions go unasked.

Methodological notes: I couldn’t rank linguistic health among native languages without first deciding what “counted” as a native language and what was simply a dialect of a language. This language/dialect issue is sometimes difficult to navigate, and Ethnologue typically gives each dialect its own language code. But such granularity is misleading; Madrid Spanish and Buenos Aires Spanish are different in many respects, but speakers in both places can understand one another because they are still, despite the differences, speaking Spanish.

Mutual intelligibility between speaker populations is the general rule for differentiating a dialect from a separate language, and I’ve done my best to follow that rule. For example, I’ve counted Ojibwe as a single language, even though Ojibwe is in fact a continuum of dialects; on the other hand, I’ve divided the Miwok continuum into different languages (Sierra Miwok, Plains Miwok, et cetera). Speakers of the Miwok languages, while closely related, have difficulty understanding one another in a way that speakers of Ojibwe dialects do not. So, Ojibwe is a single language, while the Miwok ‘dialects’ should really be considered separate languages.

However, none of this made huge differences in the ranking. Some might quibble with my grouping of all Ojibwe or Cree dialects into a single language, but even had I taken out the dialects that aren’t perfectly intelligible with the others, each of these languages still would have retained tens of thousands of speakers. Conversely, even had I counted all Miwok speakers as a single linguistic group, Miwok would still have fewer than 50 speakers.

Finally, when compiling statistics on numbers of speakers for each language, I used field linguists’ counts when they were available, rather than census counts, which tend to err on the side of liberality. (E.g., according to the U.S. census, there are over 150,000 Navajo speakers, but most linguists consider this an unlikely number.)

Meanings of ‘writing’ and ‘rhetoric’ in RSQ and CCC

Earlier this year, I compiled two small corpora of article abstracts from the most prominent journals in the American fields of rhetoric and writing studies: Rhetoric Society Quarterly and College Composition and Communication, respectively. The RSQ abstracts stretch from Winter 2000 (30.1) to Fall 2011 (41.5), for a total of 220 abstracts. The CCC abstracts stretch from February 2000 (51.3) to September 2011 (63.1), for a total of 261 abstracts. I think that article abstracts are a good vantage point for looking at disciplinary trends, because (in the humanities, anyway) researchers tend to write abstracts that function like movie previews. Designed to appeal to a specific disciplinary audience, abstracts signal that their articles ‘belong’ in the field by using all the right buzz words, name-dropping all the right researchers, and making all the right stylistic moves that make other researchers want to read the article.

Using Python and the Natural Language Toolkit to explore these two corpora of abstracts, I’ve discovered both interesting and unsurprising things about how rhetoric and writing studies have taken shape, over the last decade, as separate but ambivalently related disciplines. One of the more interesting pieces of capta demonstrated by the corpora is that the words ‘writing’ and ‘rhetoric’ share grammatical contexts with very different lexical items, suggesting that each word means something different in each journal.

Before I get to the details, though, here’s a bit about my methodology:

With Python and NLTK, you can chart how a word is used  similarly or differently in two corpora.  For instance, a concordance of the word ‘monstrous’ in Moby Dick reveals contexts such as ‘the monstrous size’ and ‘the monstrous pictures’. Running a few extra commands, you discover that words such as ‘impalpable‘, ‘imperial’, and ‘lamentable’ are also used in these same contexts. Running an identical search on Sense and Sensibility, however, reveals that ‘monstrous’ shares contexts with quite different terms: ‘very’, ‘exceedingly’, and ‘remarkably’. Dissimilar contexts reveal different connotations for ‘monstrous’ in each novel, positive or neutral in Austen but negative in Melville. This, basically, was the method I applied for mining the usage of ‘rhetoric’ and ‘writing’ in the abstracts corpora (more details below the tables, though).

‘Rhetoric’ occurred 244 times in RSQ abstracts and 69 times in CCC abstracts. ‘Writing’ occurred 22 times in RSQ and 251 times in CCC. I compiled common grammatical contexts for each term in each corpus. Each context took the form,

N  x  N

where N was any term and x was ‘rhetoric’ or ‘writing’, respectively.

nltkblog

The more and more contexts shared by two terms, the more and more likely it is that the two terms, within the specific corpus, are used interchangeably. One way to get your head around this fact is by looking at grammatical contexts without an operative term:

(1) I _ you

In an English corpus, the words that appear in that _ context will be semantically limited. Hundreds, if not thousands of words, will indeed fit in that context, but given such a large list of lexical items, all the items will nonetheless share some kind of discerning semantic value: for example, all the words  that can appear in the context of (1) can only be transitive or di-transitive verbs, and they cannot be 3rd person present verbs. Right off the bat, this context has limited its possible terms down to a fraction of all the words in the English lexicogrammar. Throw in a second context, and the list of terms grows even smaller:

(2) is _ by

Given rules of English morphology and semantics, most of the words that appear in this context will be past tense action or emotive participles (e.g., loved, felt, killed, written, eaten, trapped). Terms that can appear in both (1) and (2) are quite limited: only transitive or di-transitive verbs, no 3rd person presents, and now, no irregular verbs (e.g., written, wrote, eaten, ate).

If we start using contexts that contain more than just semantically null stopwords on both sides, it’s easy to see how the list of terms can grow very short very quickly:

(3) I _ girls

What kind of words can appear in (1), (2), and (3)? No irregular verbs, no 3rd person present verbs, and now, probably no di-transitive verbs, given the lack of a definite article before ‘girls’ (e.g., I put the girls to bed). Words that can appear in all three of these contexts would likely be words that are easily grouped together in some meaningful way.

So, when a corpus analysis tells us that two words share half a dozen or more contexts in a specific corpus, you can see how these words might share not only grammatical but semantic and definitional attributes within the corpus. The simple example of ‘monstrous’, ‘lamentable’, and ‘imperial’ in Melville demonstrates this statistical fact. This fact is also proved by the large number of contexts (20!) shared by ‘writing’ and ‘composition’ in the CCC abstracts, two words that I knew, a priori, were synonymous in the American field of writing studies. The analysis bears out this a priori knowledge, thus confirming the methodology.

nltk14

While the terms sharing 2 or 3 contexts in the tables above are interesting, our attention should be focused on the terms near the top of the lists. In RSQ, ‘language’, ‘discourse’, ‘art’, ‘persuasion’, ‘theory’, and ‘texts’ tell us indirectly what the word ‘rhetoric’ means in that journal; in CCC, ‘writing’, ‘composition’, ‘education’, ‘place’, and ‘theory’ provide the same information.

The highlighted terms are the terms that overlap between each journal’s set of common contexts. The overlap is minimal. For ‘rhetoric’, only a single word (‘theory’) overlaps and surfaces in more than 3 distinct contexts in each journal; for ‘writing’, no word overlaps and surfaces in more than 3 distinct contexts. More telling is that ‘writing’ and ‘rhetoric’ themselves possess a high degree of interchangeability in CCC, sharing 7 distinct contexts, but a very low degree in RSQ, sharing only 2 distinct contexts. In other wordsthese capta suggest that ‘writing’ and ‘rhetoric’ mean nearly the same thing in CCC but do not mean the same thing at all in RSQ.

Transnational Literate Lives in Digital Times

What is Transnational Literate Lives in Digital Times? Is it an electronic book? A website? An interactive multimedia experience? A scholarly monograph? An ethnography? A report on research findings? And who wrote it? Hawisher, Selfe, and Berry? Or the 13 students whose videos and stories form the backbone of the project?

Many questions we can ask about this remarkable piece of digital work. To answer the first round of questions, I’d say, “All of the above.” TLLDT is equal parts scholarly book, website, and ethnography. In answer to the second round of questions, I’d say, “All of the above.” While Hawisher, Selfe, and Berry clearly wove the student work into a cohesive and theoretically-informed whole, their project would be impossible without the involvement of the students.

I’ve spent the whole semester thus far keeping a critical eye on the texts we’ve read, pouncing on problems and complicating the solutions. However, I think I’ll switch off my critical spotlight for today, and instead write a simple list of what I liked about this digital text:

1. It’s actually transnational. The word gets thrown around a lot, but this text makes good on its title—with emphasis on the national. Too often, the national gets lost in favor of overly general notions of race or ethnicity. I’m too lazy to look for examples, but I’m sure it would be easy to find ones in which scholars talk about “Asians” after studying a group of South Korean students, or in which scholars refer to “European” as though Germans, French, and Spaniards were all one and the same. (To me, a right-wing pundit who points to genocide in Sudan and cries, “African problem!” is as absurd as a left-wing pundit who points to Mexican immigrants’ struggles and cries, “Latinos are being oppressed!”) TLLDT does not make this mistake. It doesn’t feature “Latino” students . . . it features a Mexican and a Peruvian. It doesn’t write about “Asians” . . . it allows a Chinese, a South Korean, and an American to write about themselves. TLLDT even carves a space for two Bosnian students . . . praise the gods, I can’t remember the last time I saw Bosnia represented in this field (perhaps it, too, is just “Europe”?).

2. From the text: Because participants’ voices have been such an important part of these literacy narratives, we have also tried to maintain, as far as possible, the integrity of the responses: at times using digital video to record their narratives and writing processes and at other times using their own written words and language. In addition, we have selected written passages that retain the participants’ words and phrasing, grammatical structures, and distinctive word choices, which also mark their digital videos. This approach, we believe, keeps the contributors’ language intact—along with all of its important markers of class, age, geography, and personal expression. TLLDT does a wonderful job of actually letting the subjects speak for themselves. The format is superb. I’ve always been mildly put off by the way certain books interweave “literacy narrative” with “theory”, moving from a paragraph of abridged transcripts to a paragraph of theoretical claims or disciplinary positioning. I much preferred the way in which TLLDT delivers the theoretical, contextual, and disciplinary concerns in introductions, allowing the student voices, for the most part, to take over during the meat-and-potatoes of the chapters themselves. I did not feel like the authors were putting words into the subjects’ mouths or that the authors were interrupting the flow of someone else’s narrative to make a point.

3. Related to the point above: the format of the digital text also emphasized the movement between the global and the local. The final chapter on the “digital divide” especially undertook this movement well. The larger, statistical background was represented in the introduction: the global. Then the individual students gave their point of view as “nodes” operating in the wider system: the local. (However, even within the individual chapters, there is fluid movement between presentations of larger trends and individual action.) Political turmoil in Nigeria, and yet here is a young man who still managed to play computer games and learn the value of auto-correct on Word. Pro-democracy protests in China, and here is a sister smuggling Crouching Tiger Hidden Dragon into her home, where her little brother devours it.

4. Pengfei Song’s story was particularly moving. He was born to an illiterate mother, grew up in a hut in rural China, attended a mud schoolhouse, and didn’t start learning English until his pre-teens . . . and yet he has managed, quite obviously, to attain an admirable level of fluency in standard written English. It makes me wonder about some of my American students in California, who didn’t write half as well as Song. What was their excuse again . . . ? My favorite line came from Song’s chapter, as well: When asked if his school was segregated by religion, or race, or gender, Pengfei noted,  “No.  The teachers beat the girls almost in the same way as the boys.” That’s a great line for a movie if I ever saw one . . .