Lying with Data Visualizations: Is it Misleading to Truncate the Y-Axis?

Making the rounds on Twitter today is a post by Ravi Parikh entitled “How to lie with data visualization.” It falls neatly into the “how to lie with statistics” genre because data visualization is nothing more than the visual representation of numerical information.

At least one graph provided by Parikh does seem like a deliberate attempt to obfuscate information–i.e., to lie:

y-axis2

Inverting the y-axis so that zero starts at the top is very bad form, as Parikh rightly notes. It is especially bad form given that this graph delivers information about a politically sensitive subject (firearm homicides before and after the enacting of Stand Your Ground legislation).

Other graphs Parikh provides don’t seem like deliberate obfuscations so much as exercises in stupidity:

y-axis3

Pie charts whose divisions are broken down by % need to add up to 100%. No one in Fox Chicago’s newsroom knows how to add. WTF Visualizations—a great site—provides many examples of pie charts like this one.

So, yes, data visualizations can be deliberately misleading; they can be carelessly designed and therefore uninformative. These are problems with visualization proper, and may or may not reflect problems with the numerical data itself or the methods used to collect the data.

However, one of Parikh’s “visual lies” is more complicated: the truncated y-axis:

y-axis1

About these graphs, Parikh writes the following;

One of the easiest ways to misrepresent your data is by messing with the y-axis of a bar graph, line graph, or scatter plot. In most cases, the y-axis ranges from 0 to a maximum value that encompasses the range of the data. However, sometimes we change the range to better highlight the differences. Taken to an extreme, this technique can make differences in data seem much larger than they are.

Truncating the y-axis “can make differences in data seem much larger than they are.” Whether or not differences in data are large or small, however, depends entirely on the context of the data. We can’t know, one way or the other, if a difference of .001% is a major or insignificant difference unless we have some knowledge of the field for which that statistic was compiled.

Take the Bush Tax Cut graph above. This graph visualizes a tax raise for those in the top bracket, from a 35% rate to a 39.6% rate. This difference is put into a graph with a y-axis that extends from 34 – 42%, which makes the difference seem quite significant. However, if we put this difference into a graph with a y-axis that extends from 0 – 40%—the range of income tax rates—the difference seems much less significant:

y-axis4

So which graph is more accurate? The one with a truncated y-axis or the one without it? The one in which the percentage difference seems significant or the one in which it seems insignificant?

Here’s where context-specific knowledge becomes vital. What is actually being measured here? Taxes on income. Is a 35% tax on income really that much greater than a 39.6% tax? According to the current U.S. tax code, this highest bracket affects individual earnings over $400,000/year and, for  married couples, earnings over $450,000/year. Let’s go with the single rate. Let’s say someone makes $800,000 per year in income, meaning that $400,000 of that income will be taxed at the highest rate:

35% of 400,000 = 0.35(400,000) = 140,000

39.6% of 400,000 = 0.396(400,000) = 158,400

158,400 – 140,000 = 18,400

So, in real numbers, not percent, the tax rate hike will equal $18,400 to someone making 800k each year. It would equal more $$$ for those earning over a million. So, the question posed a moment ago (which graph is more accurate?) can also be posed in the following way: is an extra eighteen grand lost annually to taxes a significant or insignificant amount?

And this of course is a subjective question. Ravi Parikh thinks it’s not a significant difference, which is why he used the truncated graph as an example in a post titled “How to lie with data visualization.” (And as a graduate student, my response is also, “Boo-freaking-hoo.”) However, imagine a wealthy couple, owners of a successful car dealership, being taxed at this rate (based on a combined income of ~800k). They have four kids. Over 18 years, the money lost to this tax raise will equal what could have been college tuition for two of their kids. I believe they would think the difference between 35% and 39.6% is significant. (Note that the “semi-rich” favor Republicans, while the super rich, the 1%, favor Democrats.)

What about the baseball graph? It shows a pitcher’s average knuckleball speed from one year to the next. When measuring pitch speed, how significant is the difference between 77.3 mph and 75.3 mph? Is the truncated y-axis making a minor change more significant than it really is? As averages across an entire season, a drop in 2 mph does seem pretty significant to me. If Dickey were a fastball pitcher, averaging between 92 mph and 90 mph would mean fewer pitches under 90mph, which could lead to a higher ERA, fewer starts, and a truncated career. For young pitchers being scouted, the difference between an 84 mph pitch and an 86 mph pitch can apparently mean the difference between getting signed and not getting signed. Granted, there are very few knuckleballers in baseball, so whether or not this average difference is significant in the context of the knuckleball is difficult to ascertain. However, in the context of baseball more generally, a 2 mph average decline in pitch speed is worth visualizing as a notable decline.

So, do truncated y-axes qualify as the same sort of data-viz problem as pie charts that don’t add up to 100%? It depends on the context. And there are plenty of contexts in which tiny differences are in fact significant. In these contexts, not truncating the y-axis would mean creating a misleading visualization.

Distant Reading and the “Evolution” Metaphor

1

Are there any corpora that purposefully avoid “diachronicity”? There are corpora that possess no meta-data about publication dates and whose texts are therefore organized by some other scheme—for example, the IMDB movie review corpus, which is organized according to positive/negative polarity; its texts, as far as I know, are not arranged chronologically or coded for time in any way. And there are cases where time-related data are not available, easily or at all. But have any corpora been compiled with dates—the time element—purposefully elided? Is time ever left out of a corpus because that information might be considered “noise” to researchers?

Maybe in rare situations. But for most corpora whose texts span any length of time greater than a year, the texts are, if possible, arranged chronologically or somehow tagged with date information. In this universe, time flows in one direction, so assembling hundreds or thousands of texts with meta-data related to their dates of publication means the resulting corpus will possess an inherent diachronicity whether we want it to or not. We can re-arrange the corpus for machine-learning purposes, but the “time stamp” is always there, ready to be explored. Who wouldn’t want to explore it?

If we have a lot of texts—any data, really—that span a great length of time, and if we look at features in those data across the time span, what do we end up studying? In nearly all cases, we end up studying patterns of formal change and transformation across spans of time. The “evolution” metaphor suggests itself immediately. Be honest, now, you were thinking about it the minute you compiled the corpus.

One can, of course, use “evolution” as a general synonym for change. This is probably the case for Thomas Miller’s The Evolution of College English and for many other studies whose data extend only to a limited number of representative sources. However, when it comes to distant readings, the word becomes much more tempting. The trees of Moretti’s Graphs, Maps, Trees are explicitly evolutionary:

For Darwin, ‘divergence of character’ interacts throughout history with ‘natural selection and extinction’: as variations grow apart from each other, selection intervenes, allowing only a few to survive. In a seminar a few years ago, I addressed the analogous problem of literary survival, using as a test case the early stages of British detective fiction . . . (70-71)

The same book ends with an afterword by geneticist Alberto Piazza (who worked with Luigi Luca Cavalli-Sforza on The History and Geography of Human Genes). Piazza writes:

[Moretti's writings] struck me by their ambition to tell the ‘stories’ of literary structures, or the evolution over time and space of cultural traits considered not in their singularity, but their complexity. An evolution, in other words, ‘viewed from afar’, analogous at least in certain respects to that which I have taught and practiced in my study of genetics. (95)

Analogous at least in certain respects . . . For Moretti and Piazza, literary evolution is not just a synonym for change in literature. Biological evolution becomes a guiding metaphor (not perfect, by any means) for the processes of formal change analyzed by Moretti. Piazza continues:

The student of biological evolution is especially interested in the root of a [phylogenetic] tree (the time it originated). . . . The student of literary evolution, on the other hand, is interested not so much in the root of the tree (because it is situated in a known historical epoch) as in its trajectory, or metamorphoses. This is an interest much closer to the study of the evolution of a gene, the particular nature of whose mutations, and the filter operated by natural selection, one wants to understand . . . (112-113)

Obviously, for Piazza, Moretti’s study of changes to and migrations of literary form in time and space evokes the processes and mechanisms of biological evolution—there’s not a one-to-one correspondence, of course, and Piazza points this out at length, but the similarities are evocative enough that he, a population geneticist, felt confident publishing his thoughts on the subject.

In Distant Reading, Moretti has more recently acknowledged that the intense data collection and quantitative analysis that has marked work at Stanford’s Literary Lab must at some point heed “the need for a theoretical framework” (122). Regarding that framework, he writes:

The results of the [quantitative] exploration are finally beginning to settle, and the un-theoretical interlude is ending; in fact, a desire for a general theory of the new literary archive is slowly emerging in the world of digital humanities. It is on this new empirical terrain that the next encounter of evolutionary theory and historical materialism is likely to take place. (122)

In Macroanalysis, Matthew Jockers also acknowledges (and resists) the temptation to initiate an encounter between evolutionary theory and the quantitative, diachronic data compiled in his book:

. . . the presence of recurring themes and recurring habits of style inevitably leads us to ask the more difficult questions about influence and about whether these are links in a systematic chain or just arbitrary, coincidental anomalies in a disorganized and chaotic world of authorial creativity, intertextuality, and bidirectional dialogics . . .

“Evolution” leaps to mind as a possible explanation. Information and ideas do behave in a ways that seem evolutionary. Nevertheless, I prefer to avoid the word evolution: books are not organisms; they do not breed. The metaphor for this process breaks down quickly, and so I do better to insert myself into the safer, though perhaps more complex, tradition of literary “influence” . . . (155)

And in the last chapter to Why Literary Periods Mattered, Ted Underwood does not mention evolution at all but there is clearly an evolutionary connotation to the terms he uses to describe digital humanities’ influence on literary scholars’ conception of history:

. . . digital and quantitative methods are a valuable addition to literary study . . . because their ability to represent gradual, macroscopic change brings a healthy theoretical diversity to literary historicism . . .

. . . we need to let quantitative methods do what they do best: map broad patterns and trace gradients of change. (159, 170)

Underwood also discusses “trac[ing] processes of change” (160) and “causal continuity” (161). The entire thrust of Underwood’s argument, in fact, is that distant or quantitative readings of literature will force scholars to stop reading literary history as a series of discrete periods or sharp cultural “turns” and to view it instead as a process of gradual change in response to extra-literary forces—”Romanticism” didn’t just become “Naturalism” any more than homo erectus one decade decided to become homo sapiens.

Tracing processes of gradual, macroscopic change . . . if that doesn’t invoke evolutionary theory, I don’t know what does. Underwood doesn’t even need to use the word.

Moretti, Jockers, and Underwood are three big names in digital humanities who have recognized, either explicitly or implicitly, that distant reading puts us face to face with cultural transformation on a large, diachronic scale. Anyone working with DH methods has likely recognized the same thing. Like I said, be honest: you were already thinking about this before you learned to topic model or use the NLTK.

 

2

Human culture changes—its artifacts, its forms. This is not up for debate. Even if we think human history is a series of variations on a theme, the mutability of cultural form remains undeniable, even more undeniable than the mutability of biological form. Distant reading, done cautiously, gives us a macro-scale, quantitative view of that change, a view simply not possible to achieve at the scale of individual texts or artifacts. Given the fact of cultural transformation, then, and DH’s potential to visualize it, to quantify aspects of it, one of two positions must be taken.

1. The diachronic patterns we discover in our distant readings are, to use Jockers’ words, “just arbitrary, coincidental anomalies in a disorganized and chaotic world of authorial creativity, intertextuality, and bidirectional dialogics.” Theorizing the patterns is a fool’s errand.

2. The diachronic patterns we discover are not arbitrary or random. Theorizing the patterns is a worthwhile activity.

Either we believe that there are processes guiding cultural change (or, at least, that it’s worthwhile to discover whether or not there are such processes) or we assume a priori that no such processes exist. (A third position, I suppose, is to believe that such processes exist but we can never know them because they are too complex.) We can all decide differently. But those who adopt the first position should kindly leave the others to their work. In my view, certain criticisms of distant reading amount to an admonition that “What you’re trying to do just can’t be done.” We’ll see.

 

3

When we decide to theorize data from distant readings, what are we theorizing? Moretti, Jockers, and Underwood each provide a similar answer: we are theorizing changes to a cultural form over time and, in some instances, space. Certain questions present themselves immediately: Are the changes novel and divergent, or are they repeating and reticulating? Is the change continuous and gradual, or are there moments of punctuated equilibrium? How do we determine causation? Are purely internal mechanisms at work, or also external dynamics? A complex interplay of both internal mechanisms and external dynamics? How do we reduce data further or add layers of them to untangle the vectors of causation?

To me, all of this sounds purely evolutionary. Even talking about gradual vs. quick change is a discussion taken right out of Darwinian theory.

But we needn’t adopt the metaphor explicitly if we are troubled that it breaks down at certain points. Alex Reid writes:

Matthew Jockers remarks following his own digital-humanistic investigation, “Evolution is the word I am drawn to, and it is a word that I must ultimately eschew. Although my little corpus appears to behave in an evolutionary manner, surely it cannot be as flawlessly rule bound and elegant as evolution” (171). As he notes elsewhere, evolution is a limited metaphor for literary production because “books are not organisms; they do not breed.” He turns instead to the more familiar concept of “influence” . . . Certainly there is no reason to expect that books would “breed” in the same way biological organisms do (even though those organisms reproduce via a rich variety of means). [However], if literary production were imagined to be undertaken through a network of compositional and cognitive agents, then such productions would not be limited to the capacity of a human to be influenced. Jockers may be right that “evolution” is not the most felicitous term, primarily because of its connection to biological reproduction, but an evolutionary-type process, a process as “natural” as it is “cultural,” as “nonhuman” as it is “human,” may exist.

An “evolutionary-type” process of culture is what we’re after, one that is not necessarily reliant on human agency alone. Will it end up being “flawlessly rule bound and elegant as evolution”? First, I think Jockers seriously over-estimates the “flawless” nature of evolutionary theory and population genetics. If the theory of evolution is so flawless and elegant, and all the science settled, what do biologists and geneticists do all day? Here’s a recent statement from the NSF:

Understanding the tree of life has been a goal of evolutionary biologists since the time of Darwin. During the past decade, unprecedented gains in gathering and analyzing phylogenetic data have demonstrated increasingly complex genealogical patterns.

. . . . Our current knowledge of processes such as hybridization, endosymbiosis and lateral gene transfer makes clear that the evolutionary history of life on Earth cannot accurately be depicted as a single, typological, bifurcating tree.

Moretti, it turns out, needn’t worry so much about the fact that cultural evolution reticulates. And Jockers needn’t assume that biological evolution is elegantly settled stuff.

Secondly, as Reid argues, we needn’t hope to discover a system of influence and cultural change that can be reduced to equations. We probably won’t find any such thing. However, within all the textual data, we can optimistically hope to find regularities, patterns that can be used to make predictions about what might be found elsewhere, patterns that might connect without casuistic contrivance to theories from the sciences. Here’s an example, one I’ve used several times on this blog: Derek Mueller’s distant reading of the journal College Composition and Communication. Mueller used article citations as his object of analysis. When he counted and graphed a quarter century of citations in the journal, he discovered patterns that looked like this:

muellerlongtail

Actually, based on similar studies of academic citation patterns, we could have predicted that Mueller would discover this power law distribution. It turns out that academic citations—a purely cultural form, a textual artifact constructed through the practices of the academy—behave according to a statistical law that seems to affect all sorts of things, from earthquakes to word frequencies. This example makes a strong case against those who argue that cultural artifacts, constructed by human agents within their contextualized interactions, will not aggregate over time into scientifically recognizable patterns.  Granted, this example comes from mathematics, not evolutionary theory, but it makes the point nicely anyway: the creations of human culture are not necessarily free from non-human processes. Is it foolish to look for the effects of these processes through distant reading?

 

4

“Evolution,” “influence,” “gradualism”—whatever we call it in the digital humanities, those of us adopting it on the literary and rhetorical end have a huge advantage over those working in history: we have a well-defined, observable element, an analogue of DNA, to which we can always reduce our objects of study: words. If evolution is going to be a guiding metaphor, we need this observable element because it is through observations of its metamorphoses (in usage, frequency, etc.) that we begin to figure out the mechanisms and dynamics that actually cause or influence those metamorphoses. If we had no well-defined segment to observe and quantify, the evolutionary metaphor could be thrown right out.

To demonstrate its importance, allow me a rhetorical demonstration. First, I’ll write out Piazza’s description of biological evolution found in his afterword to Graphs, Maps, Trees. Then, I’ll reproduce the passage, substituting lexical and rhetorical terms for “genes” but leaving everything else more or less the same. Let’s see how it turns out:

Recognizing the role biological variability plays in the reconstruction of the memory of our (biological) past requires ways to visualize and elaborate data at our disposal on a geographical basis. To this end, let us consider a gene (a segment of DNA possessed of a specific, ascertainable biological function); and for each gene let us analyze its identifiable variants, or alleles. The percentage of individuals who carry a given allele may vary (very widely) from one geographical locality to another. If we can verify the presence or absence of that allele in a sufficient number of individuals living in a circumscribed and uniform geographical area, we can draw maps whose isolines will join all the points with the same proportion of alleles.

The geographical distribution of such genetic frequencies can yield indications and instruments of measurement of the greatest interest for the study of the evolutionary mechanisms that generate genetic differences between human populations. But their interpretation involves quite complex problems. When two human populations are genetically similar, the resemblance may be the result of a common historical origin, but it can also be due to their settlement in similar physical (for example, climactic) environments. Nor should we forget that styles of life and cultural attitudes of an analogous nature (for example, dietary regimes) can favour the increase or decrease to the point of extinction of certain genes.

Why do genes (and hence their frequencies) vary over time and space? They do so because the DNA sequences of which they are composed can change by accident. Such change, or mutations, occurs very rarely, and when it happens, it persists equally rarely in a given population in the long run . . . From an evolutionary point of view, the mechanism of mutation is very important because it introduces innovations . . .

. . . The evolutionary mechanism capable of chancing the genetic structure of a population most swiftly is natural selection, which favours the genetic types best adapted for survival to sexual maturity, or with a higher fertility. Natural selection, whose action is continuous over time, having to eliminate mutations that are injurious in a given habitat, is the mechanism that adapts a population to the environment that surrounds it. (100-101)

Now for the “distant reading” version:

Recognizing the role lexical variability plays in the reconstruction of the memory of our (literary and rhetorical) past requires ways to visualize and elaborate data at our disposal on the basis of cultural space (which often correlates with geography). To this end, let us consider a word (a segment of phonemes and morphemes possessed of a specific, ascertainable grammatical or semantic function); and for each word let us analyze its stylistic variants, or synonyms. The percentage of texts that carry a given stylistic variant may vary from one cultural space to another, or from one genre to the other. If we can verify the presence or absence of that variant in a sufficient number of texts produced in a circumscribed and uniform cultural space we can draw maps whose isolines will join all the points with the same proportion of stylistic variants.

The distribution of such lexical frequencies can yield indications and instruments of measurement of the greatest interest for the study of the evolutionary mechanisms that generate lexical differences between “generic populations.” But their interpretation involves quite complex problems. When two rhetorical forms or genres are lexically similar, the resemblance may be the result of a common historical origin, but it can also be due to their development in similar geographic or political environments. Nor should we forget that styles of life and cultural attitudes of an analogous nature (for example, religious dictates) can favour the increase or decrease to the point of extinction of certain lexical items or clusters of lexical items.

Why do words (and hence their frequencies and “clusterings”) vary over time and space? They do so because of stylistic innovations. Such innovation occurs very rarely, and when it happens, it persists equally rarely in a given generic population in the long run . . . From an evolutionary point of view, the mechanism of innovation is very important because it introduces new rhetorical forms . . .

. . . The evolutionary mechanism capable of changing the lexical structure of a rhetorical form or genre most swiftly is cultural selection, which favours the forms best adapted for survival to publication and circulation, or with a higher degree of influence (meaning a higher likelihood of being reproduced by others without too many changes). Cultural selection, whose action is continuous over time, having to eliminate rhetorical innovations or “mutations” that are injurious in a given cultural habitat, is the mechanism that adapts a rhetorical form to the environment that surrounds it.

Obviously, it’s not perfect. I leave it to the reader to decide its persuasive potential.

I think the biggest problem is in the handling of mutations. In biological evolution, genes mutate via chance variations during replication of their segments; these mutations can introduce innovations in an organism’s form or function. In literary evolution, however, no sharp distinction exists between a lower-scale “mutation” and the innovation it introduces. The innovation is the formal mutation. This issue arises because, in literary evolution, as in linguistic evolution, the genotype/phenotype distinction is not as obvious or strictly scaled as it is in evolutionary theory. Words are more phenotype than genotype, unless we want to get lost in an overly complex evocation of morphology and phonology.

The metaphor always breaks down somewhere, but where it works, it is, I think, highly suggestive: the idea is that we track rhetorical forms—constellations of words and their stylistic variants—across time and space, in order to see where the forms replicate and where they disappear. Attach meta-data to the texts that constitute those forms, and we will have what it takes to begin making data-driven arguments about how cultural ecology affects or does not affect cultural form.

It’s an interesting framework in which distant reading might go forward, even if explicit uses of the word “evolution” are abandoned.

Historical Linguistics and Population Genetics

Reich et al.  provide a model of two ancient populations in India that are ancestral to modern populations—Ancestral North Indians (ANI) and Ancestral South Indians (ASI). According to Reich et al, ANI is, on average, more genetically similar to Middle Easterners, Central Asians, and Europeans. ASI, on the other hand, is distinct from ANI as well as from East Asian populations. This same study found that “ANI ancestry ranges from 39–71% in most Indian groups, and is higher in traditionally upper caste and Indo-European speakers.” Furthermore, Reich et al. showed that the Indian caste system is old and historically implacable—high FST values indicate that “strong endogamy must have shaped marriage patterns in India for thousands of years.” This seriously contradicts the claims of Edward Said, Nicholas Dirks, and others who have argued that caste in India was more fluid and less systematized before British imperial rule.

However, a recent paper (Moorjani et al. 2013) does show fluid population admixture between Indian groups somewhere between 1,900 and 4,200 years ago.

Our analysis documents major mixture between populations in India that occurred 1,900 – 4,200 years BP, well after the establishment of agriculture in the subcontinent. We have further shown that groups with umixed ANI and ASI ancestry were plausibly living in India until this time. This contrasts with the situation today in which all groups in mainland India are admixed. These results are striking in light of the endogamy that has characterized many groups in India since the time of admixture. For example, genetic analysis suggests that the Vysya from Andhra Pradesh have experienced negligible gene flow from neighboring groups in India for an estimated 3,000 years. Thus, India experienced a demographic transformation during this time, shifting from a region where major mixture between groups was common and affected even isolated tribes such as the Palliyar and Bhil to a region in which mixture was rare.

As the researchers go on to indicate, ~2,000 to 3,000 years ago corresponds to the major transitions attendant to the end of the Harappan civilization and the influx of the Indo-Aryans. Can these genetic studies shed any light on the controversies of Indian language history?

Emeneau’s famous 1956 paper, “India as a Linguistic Area,” holds up reasonably well to contemporary scrutiny. The Indo-Aryan, Dravidian, and Munda language families have obviously influenced one another. Dravidian influence on Indo-Aryan is well attested. But this seems odd given the correlation, discovered by Reich et al. and others, between Indo-European speaking ancestry and upper caste status in India. Another population genetics study (Bamshad et al. 2001) puts it this way:

Indo-European-speaking people from West Eurasia entered India from the Northwest and diffused throughout the subcontinent. They purportedly admixed with or displaced indigenous Dravidic-speaking populations. Subsequently they may have established the Hindu caste system and placed themselves primarily in castes of higher rank.

These “Indo-European-speaking people” probably have something to do with Reich et al.’s Ancestral North Indians. But if these “invaders” were strong enough to admix with and displace the indigenous Dravidic-speaking populations, why does Emeneau find Dravidian influence on Indo-Aryan? Imagine Cherokee influencing English on the scale of 5%. It’s just not going to happen. Most linguistic history shows that dominant languages influence less dominant languages; the opposite rarely occurs, and if it does, its influence on the dominant language is minimal.  In another paper, Emeneau has this to say:

[There has long been the assumption] that the Sanskrit-speaking invaders of Northwest India were people of a high, or better, a virile, culture, who found in India only culturally feeble barbarians, and that consequently the borrowings that patently took place from Sanskrit and later Indo-Aryan languages into Dravidian were necessarily the only borrowings that could have occurred . . . It was but natural to operate with the hidden, but anachronistic, assumption that the earliest speakers of Indo-European languages were like the classical Greeks or Romans—prosperous, urbanized bearers of a high civilization destined in its later phases to conquer all Europe and then a great part of the earth—rather than to recognize them for what they doubtless were–nomadic, barbarous looters and cattle-reivers whose fate it was through the centuries to disrupt older civilizations but to be civilized by them.

Rather than the image of Indo-European “invaders” whose civilized power subjugated indigenous Indian populations, Emeneau instead imagines barbarians at the gates. Certainly, the language of nomads would be more socially susceptible to indigenous Dravidian, but how does this picture fit with the recent discovery of early population admixture? Would indigenous Dravidians have been more likely to breed freely with uncivilized nomads roaming and slowly penetrating the borderlands? Possibly.

Michael Witzel might have a different solution. The oldest Indian text following the actual Harappan script itself is the Rigveda, a collection of sacred Vedic Sanskrit hymns. Witzel finds in the earliest sections of the Rigveda several hundred lexical items and a few morphological features that are clearly not of Sanskrit (and therefore, not of Indo-European) origin. His analysis of these features leads him to believe that the language spoken before the arrival of Indo-Europeans—i.e., spoken in the Harappan civilization—was more closely related to the Munda languages and the Austroasiatic language family. In other words, Witzel’s analysis suggests that an Indo-European “invasion” and domination of indigenous Dravidian speakers is probably not an accurate historical picture. A sacred Indo-European text like the Rigveda would not contain so many non-IE loanwords if its speakers had entered the scene as dominant bringers of hierarchy. And given that the non-IE loanwords and morphological features are more likely Austroasiatic than Dravidian, Witzel envisions a time when Indo-European speakers and Dravidian speakers immigrated slowly into Harappan civilization, neither dominant invaders nor barbarous raiders. This would explain the cross-linguistic influence in the Indian subcontinent. It would also explain Moorjani et al.’s recent paper showing major mixture between groups in India prior to the rise of the caste system several thousand years ago.

Or maybe not. Witzel’s theory is not well accepted among historical linguists. And if Indo-Aryan and Dravidian immigration was so gradual and perhaps even egalitarian (Witzel imagines that Harappan urban centers may have been trilingual), from whence came a caste system that so clearly favors one ancestral group over the others? And there’s a nagging question about timing: one study suggests that Reich’s ANI might not fit within the purported timeline of Indo-European speakers’ migration. There’s also the issue of linguistic distribution. Razib Khan notes:

It seems an almost default position by many that the Austro-Asiatics are the most ancient South Asians, marginalized by Dravidians, and later Indo-Europeans. I would not be surprised if it was actually first Dravidians, then Austro-Asiatics and finally Indo-Europeans. Dravidians are found in every corner of the subcontinent (Brahui in Pakistan, a few groups in Bengal, and scattered through the center) while the Austro-Asiatics exhibit a more restricted northeastern range.

It’s all quite messy, but my point is that linguists interested in language contact and linguistic evolution should be reading work in population genetics, too. Papers on population genetics often reference work in historical linguistics; however, I rarely see historical linguists citing population genetics.

“Re-purposing Data” in the Digital Humanities

Histories of science and technology provide many examples of accidental discovery. Researchers go looking for one thing and find another. Or, more often, they look for one thing, find something else but don’t realize it until someone points it out in a completely different context. The serendipitous “Eureka!” is the most exciting of all.

Take the microwave oven. Its inventor, Percy Spencer, was not trying to discover a quick, flameless way to cook food. He was working on a magnetron, a vacuum tube designed to produce electromagnetic wavelengths for short wave radar. One day, he came to work with a chocolate bar in his pocket. The wavelengths melted the candy bar. Intrigued, Spencer tried to pop popcorn with the magnetron. That worked, too. So Spencer constructed a metal box, then fed micro-waves and food into it. Voila. A radar tech discovers that a property of the magnetron can be repurposed, from creating short wavelengths for radar to creating hot dogs in 30 seconds.

Another example is the discovery of cosmic microwave background radiation, the defining piece of evidence in support of the Big Bang Theory. Wikipedia tells the story well:

By the middle of the 20th century, cosmologists had developed two different theories to explain the creation of the universe. Some supported the steady-state theory, which states that the universe has always existed and will continue to survive without noticeable change. Others believed in the Big Bang theory, which states that the universe was created in a massive explosion-like event billions of years ago (later to be determined as 13.8 billion).

Working at Bell Labs in Holmdel, New Jersey, in 1964, Arno Penzias and Robert Wilson were experimenting with a supersensitive, 6 meter (20 ft) horn antenna originally built to detect radio waves bounced off Echo balloon satellites. To measure these faint radio waves, they had to eliminate all recognizable interference from their receiver. They removed the effects of radar and radio broadcasting, and suppressed interference from the heat in the receiver itself by cooling it with liquid helium to −269 °C, only 4 K above absolute zero.

When Penzias and Wilson reduced their data they found a low, steady, mysterious noise that persisted in their receiver. This residual noise was 100 times more intense than they had expected, was evenly spread over the sky, and was present day and night. They were certain that the radiation they detected on a wavelength of 7.35 centimeters did not come from the Earth, the Sun, or our galaxy. After thoroughly checking their equipment, removing some pigeons nesting in the antenna and cleaning out the accumulated droppings, the noise remained. Both concluded that this noise was coming from outside our own galaxy—although they were not aware of any radio source that would account for it.

At that same time, Robert H. DickeJim Peebles, and David Wilkinsonastrophysicists at Princeton University just 60 km (37 mi) away, were preparing to search for microwave radiation in this region of the spectrum. Dicke and his colleagues reasoned that the Big Bang must have scattered not only the matter that condensed into galaxies but also must have released a tremendous blast of radiation. With the proper instrumentation, this radiation should be detectable, albeit as microwaves, due to a massive redshift.

When a friend (Bernard F. Burke, Prof. of Physics at MIT) told Penzias about a preprint paper he had seen by Jim Peebles on the possibility of finding radiation left over from an explosion that filled the universe at the beginning of its existence, Penzias and Wilson began to realize the significance of their discovery. The characteristics of the radiation detected by Penzias and Wilson fit exactly the radiation predicted by Robert H. Dicke and his colleagues at Princeton University. Penzias called Dicke at Princeton, who immediately sent him a copy of the still-unpublished Peebles paper. Penzias read the paper and called Dicke again and invited him to Bell Labs to look at the Horn Antenna and listen to the background noise. Robert Dicke, P. J. E. Peebles, P. G. Roll and D. T. Wilkinson interpreted this radiation as a signature of the Big Bang.

Penzias and Wilson were looking for one thing for Bell Labs, found something else, thought it might have been pigeon shit, then realized they’d stumbled upon evidence directly relevant to another research project.

In the sciences, data are data, and once presented, they are there for the taking. “Repurposing data”—using data compiled for one project for your own project. In some sense, all scholars do this. Bibliographies and lit reviews signal that a piece of scholarship has built on existing scholarship. In the humanities, however, scholars are accustomed to building on whole arguments, not individual points of data. If Dicke, Peebles, and Wilkinson had been humanists, they would have asked, “How does the practice of detecting faint radio waves bounced off Echo balloon satellites relate to our work on cosmic background radiation?” Which is not necessarily the wrong question to ask, the connection might have been forged eventually, but given that everyone involved were scientists, no one posed the question that way, and I imagine it was much more natural for Penzias’ and Wilsons’ data to be removed from its  context and placed into another context. Humanists, on the other hand, are not conditioned to chop up another scholar’s argument, isolate a detail, remove it, and put it into an unrelated argument. This seems like bad form. Sources, their contexts, the nuances of their arguments are introduced in total—this is vital if you are going to use a source properly in the humanities.

Digital humanists construct arguments just like any other humanists, but rather than deploying what Rebecca Moore Howard calls “ethos-based” argumentation, DH’s typically traffic in mined and researched data—the locations of beginnings and endings in Jane Austen novels; citation counts in academic journals; metadata relating to the genders and nationalities of authors. These data always exist in the context of a specific argument made by the researcher who has compiled them, but data are more portable than ethos-based arguments, in which any one strand of thought relies on all the others. No such reliance exists, however, in data-based argumentation. In other words, an antimetabole: a data-based argument relies on the data, but the data do not rely on the argument.

A hypothetical example and a real one:

In “Style, Inc: Reflections on 7,000 Titles,” Moretti compiles a very particular set of data: the word counts of British novel titles between 1740 and 1850. He provides several graphs to document an obvious trend, that novel titles got drastically shorter throughout the 18th and 19th centuries. From these data, Moretti makes, as he usually does, a compelling argument about the literary marketplace and its effect on literary form:

As the number of new novels kept increasing, each of them had inevitably a much smaller ‘window’ of visibility on the market, and it became vital for a title to catch quickly and effectively the eye of the public. [Summary titles] were not good at that. They were good at describing a book in isolation: but when it came to standing out in a crowded marketplace, short titles were better—much easier to remember, to begin with. (187-88)

Moretti’s argument relies on his analysis of data about novel titles; his argument would be weaker (non-existent?) without the data. But now that these data have been compiled, are they useful only in the context of Moretti’s argument? Of course not. Let’s say I’m a book historian writing my dissertation on changing book and paper sizes between 1500 and 1900. Let’s say I’ve discovered (hypothetically—it’s probably not true) that smaller book sizes—duodecimos and even sextodecimos—proliferated between 1810 and 1900, relative to earlier decades in the 18th century. Now let’s say I find Moretti’s article on shortened book titles during the same period. Hmm, I think. Interesting. Never mind that “Style, Inc.” is focused on literary form, never mind that I’m writing about the materials of book history, never mind that I’m not interested in Moretti’s argument about literary form per seMoretti’s data nevertheless might generate an interesting discussion. Maybe I’ll look at titles more closely. Maybe I can even get a whole chapter out of this—”Titles and Title Pages in relation to Book Sizes.” A serendipitous connection. A scholar in book history and a literary scholar making different but in no way opposed arguments from the same data.

Real example: I’ve just finished a paper on the construction of disciplinary boundaries in academic journals. In it, I use data from Derek Mueller’s article which counts citations in the journal College Composition and Communication. I also compile citations from other journals, focusing on citations in abstracts. But the argument I make is not quite the same as Mueller’s. In fact, I analyze my data on citations in a way that hopefully shines a new light on Mueller’s data. Both Mueller and I discover (unsurprisingly) that citations in articles and abstracts form a power law distribution. Mueller argues that the “long tail” of the citation distribution implies a “loose amalgamation” of disparate scholarly interests and that the head of the distribution represents the small canon uniting the otherwise disparate interests. I argue, however, that when we look at the entire distribution thematically, we discover that each unique citation added to the distribution—whether it ends up in the head or the long tail—may in fact be thematically connected to many other citations, whether they also be in the head or the long tail. (For example, Plato is in the head of one journal’s citation distribution, and Aristophanes is in the long tail, but a scholar’s addition of Aristophanes to the long tail does not imply scholarly divergence from the many additions of Plato. Both citations suggest unity insofar as both signal a single scholarly focus on rhetorical history.)

I re-purpose Mueller’s data but not his argument. Honestly, in my paper, I don’t spend much time at all working through the nuances of Mueller’s paper because they’re not important to mine. His data are important—they and the methods he used to compile them are the focus of my argument, which moves in a slightly different direction than Mueller’s.

To reiterate: data in the digital humanities beg to be re-purposed, taken from one context and transferred to another. All arguments rely on data, but the same data may always be useful to another argument. At the end of my paper, I write: “I have used these corpora of article abstracts to analyze disciplinary identity, but this same group of texts can be mined with other (or the same) methods to approach other research questions.” That’s the point. Are digital humanists doing this? They certainly re-purpose and evoke one another’s methods, but to date, I have not seen any papers citing, for example, Moretti’s actual maps to generate an argument not about methods but about what the maps might mean. Just because Moretti generated these geographical data does not mean he has sole ownership over their implications or their usefulness in other contexts.

There’s a limit to all this, of course. Pop-science journalism, at its worst, demonstrates the hazards of decontextualizing a data-point from a larger study and drawing all sorts of wild conclusions from it, conclusions contradicted by the context and methods of the study from which the data-point was taken. It is still necessary to analyze critically the research from which data are taken and, more importantly, the methods used to obtain them. However, if we are confident that the methods were sound and that our own argument does not contradict or over-simplify something in the original research, we can be equally confident in re-purposing the data for our own ends.

Graphing Citations and Making Sense of Disciplinary Divisions

A Pareto distribution: the troubling result of Derek Mueller’s distant reading of citations in College Composition and Communication: a “long tail” of citations, a handful of names cited many times but exponentially more names cited only once. Out of 8,035 unique citations, 5,761 were cited once and 986 were cited twice. In other words, 84% of citations in CCC occurred only once or twice in a 25 year period.

Troubling, but unsurprising. Physical and social scientists have long known that power law distributions occur across a wide variety of phenomena, including academic citations (Gupta et al. 2005). That a long tail occurs in a rhet/comp journal simply puts our discipline in the same position as everyone else: a small group of scholarly work has gained a “cumulative advantage” or “preferential attachment” and thus become the core set of classic texts recognized by the field, while most other scholars fail to produce texts that cross the tipping point toward their own preferential attachment. It is usually assumed that this core group of scholars is what unites a discipline. To some extent, the assumption is probably true. However, Mueller is right to ask how far a citation trail can lead away from that core group of scholars before we start questioning just how unified a discipline really is.

When graphing citation counts, it’s not problematic to discover a steep drop between the most cited scholar and the tenth most cited scholar; nor is it problematic that most sources are cited infrequently. The problem is not the long tail. The problem, in CCC’s case, is that the long tail very rapidly approaches a value equal to one. This indicates that any given source in CCC is valuable to the scholar citing it but effectively worthless to everybody else who publishes in the journal. If most citations occurred three, four, five times, even that would suggest a certain unity of purpose—what one scholar has found valuable, several others have found valuable as well, in various issues and various contexts. But when the long tail is mostly comprised of sources cited once and never again? That requires a more robust explanation than a nod toward a core group of scholars can provide. Mueller thus raises the right question:

Although we do not at this time have data from all of the major journals to investigate this fully, the changing shape of the graphed distribution reiterates more emphatically a question only hinted at . . . but one nevertheless crucial to the idea of a common disciplinary domain: How flat can the citation distribution become before it is no longer plausible to speak of a discipline?

To answer Mueller’s call for more data, I have compiled article abstracts from CCC and two other major journals in the field—Rhetoric Society Quarterly and Rhetoric Review. I intend this post to serve as a tentative response to the question posed by Mueller at the end of this quote.  The CCC abstracts run from February 2000 (51.3) to September 2011 (63.1), a total of 261 abstracts. The RSQ abstracts run from Winter 2000 (30.1) to Fall 2011 (41.5), a total of 220 abstracts. The RR abstracts run from 2002 (21.3) to 2011 (30.4), a total of 154 abstracts.

Only abstracts, not full articles. However, because only the most important citations appear in abstracts, I think tallying abstract citations offers the best chance to shorten the long tail and partially alleviate the implications of Mueller’s work. It is not a slight to the humanities to point out that articles demand more citations than their arguments actually require: many article citations can be removed without affecting anything vital to an argument. Citations in abstracts, on the other hand, are in most cases central to the argument or study undertaken. If we count only the most important sources in each journal—the ones that surface in abstracts—is the long tail of citation distributions less pronounced? We can expect to discover a long tail. That’s a mathematical inevitability. But if a journal—to say nothing of an entire discipline—is somehow unified, citations in abstracts should have a slightly less extreme power law distribution than citations in the articles themselves. Abstract citations are the “cream of the crop,” those vital enough to make it into the space constraints of the abstract genre: we hope to find fewer citations and therefore a graph that does not drop so precipitously toward x=1.

Methods: Each corpus was uploaded to the Natural Language Toolkit and tagged for part of speech. Then I compiled proper nouns. The proper noun list was larger than but included proper names. I extracted these names—noun forms (e.g. ‘Burke’ or ‘Burke’s) and adjective forms (e.g. ‘Burkean’)—and tracked them across the abstracts. I compiled each unique citation as well as the number of times each was cited in an abstract.

Finding citation names

Finding citation names

Here are spreadsheets with the unique citations and their citation counts in each abstracts corpus: College Composition and Communication. Rhetoric Society Quarterly. Rhetoric Review.

There are 79 unique citations in the CCC abstracts; 159 unique citations in the RSQ abstracts; and 121 unique citations in the RR abstracts. Only six citations occur in both the RSQ and CCC abstracts corpora: Mina Shuaghnessy, Kenneth Burke, John Dewey, Donald Davidson, Peter Elbow, and Mikhail Bakhtin. When factoring in RR, only Kenneth Burke, John Dewey, and Peter Elbow are shared across all three corpora. RR and RSQ share quite a few sources, almost all of which are historical figures—Plato, Aristotle, Cicero, Isocrates, and the like. Kenneth Burke is the most frequently cited source in each abstracts corpus: he is cited in 5 separate abstracts in CCC, 17 in RSQ, and 14 in RR. Maybe “rhetoric and composition” should be changed to “Burkean studies.” No surprise—the man has his own journal.

Based on the raw count of unique citations in each journal—on average, less than one per abstract—I think my original suggestion is at least partially correct: counting citations in abstracts controls for the rhetorical demand of articles to cite more sources than necessary. Abstract citations are the stars of the show. Nevertheless, after graphing the citations, Pareto distributions did emerge:

CCC abstract citations

CCC abstract citations

RSQ abstract citations

RSQ abstract citations

RR abstract citations

RR abstract citations

Citations in the CCC abstracts occurred in a slightly more even distribution than citations in CCC articles (c.f., Mueller). But then, there aren’t many citations in this corpus, relative to the RSQ and RR corpora. Among the citations that do appear, none occur in numbers much greater than those occurring in only one abstract. The citation occurring most frequently—Burke—occurs in five abstracts. Does this graph confirm Mueller’s conclusion about a dappled CCC? To some extent, yes. There’s still a long tail, after all . . .

RSQ citations even more obviously display the Pareto distribution discussed in Mueller’s article. The citations occurring most frequently—Burke and Plato—surface in 17 and 14 abstracts, respectively.

The distribution in RR is also uneven, and the drop of the long tail is even more precipitous than the one in RSQ. Burke is cited in 14 abstracts and the next most frequent source, Aristotle, is cited in 5 abstracts.

These graphs indicate that even in article abstracts—where only the most vital sources are invoked—a small canon of core scholars emerges beside an otherwise long, flat, dapple distribution of citations. More divergence and specialization, then—not just in CCC but in RR and RSQ.

I think there’s more to it than disciplinary divergence, however. These long tails can undoubtedly be explained mathematically—the conclusion: they’re inevitable—but in this particular case they might also be explainable in prosaic terms. And I believe this prosaic explanation makes sense of the long tail in a way that salvages a shred of disciplinary unity within each journal:

In RR and RSQ, for example, an obvious citation pattern emerges. Five of the ten most cited sources in the RSQ abstracts are historical figures: Plato, Aristotle, Quintilian, Blair, and Cicero. In RR, the exact same thing: Aristotle, Cicero, Isocrates, Plato, Quintilian. But glancing through the long tail in both citation counts, historical figures continue to emerge, mostly from the Greco-Roman world, but from beyond it, as well. In the CCC long tail, on the other hand, historical figures occur in less frequent numbers, and only two pre-19th century.

Raw numbers for RR and RSQ: 27 (or 22%) of the RR citations are sources from the 17th century or earlier. 26 (or 16%) of RSQ citations are from the same period. Most are Greco-Roman sources, but Confucius, Montaigne, and Averroes are also scattered throughout the long tail. We might conclude, then, that a decently sized community of historians of rhetoric communicate in RSQ and RR (when they’re not communicating in Rhetorica, presumably). Their communication adds to the long tail, but does it signify disciplinary divergence and specialization?

Rather, here is one disciplinary community—historians of rhetoric—mapped out in unity. Its borders extend slightly into CCC but its principal territory lies in RSQ and RR. An obvious outcome, if you’re involved in the field. However, it also helps us make partial sense of that worrying Pareto distribution: not all of the singular citations that constitute the long tail are as disconnected as the graphs lead us to believe. In RSQ and RR, many singular citations could be grouped together: Plutarch, Laertius, Strabo, Aristophanes—these are, at least, not as indicative of a dappled disciplinary identity as, say, St. Paul and Steven Mailloux.

The same point can be made with pedagogy in the CCC abstracts. It is not surprising, of course, that CCC is home to scholars citing pedagogically-inclined sources; however, for a second time, this obvious point helps make sense of the Pareto distribution of citations presented here and in Mueller’s article: Charles Pierce, Mina Shuaghnessy, Melvin Tolson, Les Perelman—each appears only once, scattered throughout the long tail of abstract citations. But each is invoked for its direct relevance to writing pedagogy. Viewed in this way, the flat distribution of citations seems a little less dappled.

Grammatical Anaphors without C-command

More on Chomsky’s Binding Theory. It’s a good example of how generative rules are constantly formulated and re-formulated in light of new evidence—languages are infinite, there’s always new evidence—a seemingly endless process that to my mind undermines the entire concept of Universal Grammar (though not the fact of linguistic structure).

To undermine Binding Theory in particular, here’s a piece of evidence that complicates Binding Principle A. Of course, many linguists have presented reams of evidence to complicate Principle A as traditionally construed, but I’ve never seen this particular data-point, which, I think, complicates not only Principle A but also the centrality of c-command to anaphor distribution, which is what Principle A is supposed to account for.

Principle A states that one copy of a reflexive in a chain must be bound within the smallest CP or DP containing it and a potential antecedent. A reflexive is bound if it is co-indexed with and c-commanded by its antecedent Determiner Phrase (DP). Co-indexation simply means that both DPs refer to the same entity (e.g. , John and himself). C-command is a structural relation. In a syntax tree, a node c-commands its sister node and all the nodes dominated by its sister. In practical terms, a phrase in English will usually but not always c-command all the other words and phrases to the right of it (e.g., all the words spoken after the phrase):

CCommand

According to Principle A, a reflexive pronoun (also called an anaphor in generative linguistics) must be bound in its domain. It must be co-indexed with and c-commanded by another DP:

CCommandBindingIn the sentence The girl loves herself, the anaphor is co-indexed with and c-commanded by its antecedent DP. Thus, the sentence is grammatical. The anaphor cannot refer to anyone but the girl. If you wanted the anaphor to refer to everything but the girl—that is, if you added a different index to the anaphor—then you would need to change the anaphor to a pronoun, it or her, to make the sentence grammatical: The girl loves it.

The sentence *Herself loves the girl is ungrammatical, according to Principle A, because herself c-commands the girl. But it’s supposed to be the other way around: the anaphor needs to be c-commanded. It’s not, so the sentence doesn’t work.

The notion of c-command is a vital component of nearly all theories of pronoun and anaphor distribution, even the ones that have completely overhauled Chomsky’s original Binding Principles. But look at the grammatical examples in (1) and (2) below:

(1) There was a man in an attic searching through an old photo album. Surprisingly, the man’s search turned up images of himself and not his son, like he had expected.

(2) The photographer thought his lab was developing pictures of his girlfriend. Surprisingly, the photographer’s lab developed pictures of both his girlfriend and himself.

The man’s search and The photographer’s lab are possessor DPs. They have the following structure:

PossesserDP1

With possessor DPs, the possessor is actually a second DP embedded within the DP that expresses the possessor-possessee relationship. In other words, the photographer is embedded lower in the tree than the photographer’s lab. I said a moment ago that a phrase in English will usually but not always c-command all the words and phrases to the right of it. The two examples above fall under “but not always”:

PossessorDPCcommand

In (2), the photographer only c-commands lab; it is embedded too deep to c-command anything else. In (1), the man c-commands search; it is embedded too deep to c-command anything else. Neither DP c-commands into the Verb Phrase, which means that neither DP c-commands the anaphor embedded within the Verb Phrase. The anaphors in (1) and (2) are not c-commanded and thus not bound. This should trigger a Principle A violation, but according to my judgment and the judgment of several informants, (1) and (2) sound just fine.

If anaphor distribution truly relied on c-command, then (1) and (2) above should sound just as awful as *Herself loves the girl.

I said at the beginning that Chomsky’s Binding Theory has been called into question for many years now, but as far as I know, most attempts to re-theorize it continue to rely on c-command as an important structural element for describing constraints on anaphor distribution. However, the data presented here demonstrate that anaphors can still sound grammatical even when they are not c-commanded. This indicates that discursive contexts can override the constraints of c-command on anaphor distribution.

Binding Reflexives and Herding Cats

Chomsky’s insight is that language possesses structure independent of meaning. Take the examples below:

(1a) There seems to be a girl in the garden

(1b) ??There seems to be Kate in the garden

(1c) ??There seems to be the boy in the garden

(1d) *There seems to be him in the garden

The only difference between these sentences is the noun in the garden—a girl, Kate, the boy, and him. So why does (1a) sound perfectly fine while the others sound off? Why does (1d) sound thoroughly ungrammatical? There must be structural elements involved here that are not visible in the words themselves.

Another, famous example:

(2a) Colorless green ideas sleep furiously

(2b) *Colorless ideas green furiously sleep

(2c) *Colorless green ideas sleeps furiously

Each sentence is meaningless. Yet most English speakers will agree that (2a) is fine while (2b) is word salad, and that in (2c), there’s something wrong with the verb. Again, the only reason why a meaningless sentence can still sound wrong or right is that the structure of language is at least partially independent of its meaning. From this hypothesis follows the concept of universal grammar—all human groups exhibit language, and if languages exhibit structure independent of meaning, then at a deep level, all human languages, beneath their superficial diversity, might operate upon the same structures. The goal of “Chomskyan” or “formalist” linguistic analysis is to describe the structure of this universal grammar (UG).

An adequate structural model of a language (and, eventually, of all languages) will consist of rules that can generate the grammatical sentences in the language while at the same time barring ungrammatical sentences from being generated. For the last several decades, the work of linguistics in North America and much of Europe has centered around discoveringand describing these generative rules. The problem is, that when one scholar has got a rule just right (it correctly predicts which sentences will be grammatical and which ones will be filtered out as ungrammatical), some other scholar pops up with new data showing a grammatical or ungrammatical sentence that shouldn’t exist according to the rule. And so the rule gets re-worked, made more complex, or abandoned in favor of some other rule . . . which awaits its destruction at the hands of some bizarre sentence that should or should not be grammatical.

It’s obvious that languages have structure. What’s not so obvious is that linguistic structure can be described with a closed system of rules. In the humble opinion of this blogger, trying to model UG is like trying to herd cats. Maybe you can herd most of them, but there’s always a few that just hiss and run away, and their existence seems to undermine the premise of the whole endeavor.

Take reflexive pronouns, for example. If any linguistic element can be described with robustly predictive rules, it should be reflexives. By definition, reflexives are structural: they must refer to (i.e., be co-indexed with) some other noun phrase (NP) in a sentence; otherwise, they sound ungrammatical, as in *Himself went to the store.

It has long been noted that reflexive pronouns in English and many other languages appear in complementary distribution with personal pronouns, which don’t need to co-refer with another noun phrase in a sentence:

(3a) Michael loves himself

(3b) Michael loves him

In (3a), himself can only refer to Michael. In (3b), him cannot refer to Michael; it must refer to some NP other than Michael, an NP which needn’t exist in the same sentence. If you want him to refer to Michael, you don’t use him, you use reflexive himself.

This distribution of reflexives and personal pronouns is the basis of Chomsky’s Binding Theory, specifically Binding Principles A and B, which state, respectively, that reflexives must be c-commanded by their co-indexed NP within some local domain and that pronouns cannot be c-commanded by their co-indexed NP within some local domain. Defining “domain” is tricky. Once upon a time, it appeared that the domain was the clause:

(4) Michael said that he loves Mary

In (4), the pronoun he is indeed c-commanded by its co-indexed NP, Michael, but the sentence is still grammatical. Apparently, Binding Principle A only applies intra-clausally. The “domain” for the binding principles must therefore be the clause.

Binding Principle A: A reflexive pronoun must be c-commanded by its co-indexed NP within the clause that immediately contains both the reflexive and its antecedent.

Binding Principle B: A pronoun must not be c-commanded by a co-indexed NP within the same clause.

An NP that is both c-commanded by and co-indexed with another NP is said to be “bound” by the second NP, its antecedent. Binding Principles A and B can by glossed in simpler terms by saying that a reflexive pronoun must be bound within its clause, and a personal pronoun must not be bound (or “must be free”) within its clause. As formulated, these rules correctly predict the grammaticality of many, many sentences cross-linguistically.

But not all of them:

(5) Michael loves his snake

The pronoun his is bound by Michael within the same clause. That’s a violation of Principle B. (5) should not be grammatical. But it’s grammatical. Something’s wrong with Principle B. And what about the example in (6):

(6) Mary thinks that the picture of herself look beautiful

The reflexive herself is in a separate clause from its binding NP, Mary. That’s a violation of Principle A. (6) should not be grammatical. But it is. Something’s wrong with Principle A, too.

Chomsky and others tried to tighten up the binding rules to account for these sentences by changing the definition of “domain.” I won’t go into all the details, but at the moment, standard linguistics textbooks describe the binding rules in the following way (these definitions come from Carnie):

Binding Principle A: One copy of a reflexive in a chain must be bound within the smallest CP or DP containing it anda potential antecedent.

Binding Principle B: A pronoun must be free (not bound) within the smallest CP or DP containing it but not containing a potential antecedent. If no such category is found, the pronoun must be free within the root CP.

Clearly, the only way to salvage the entire premise of the binding principles is to make them quite a bit more complicated. That’s not necessarily a mark against it. No one said linguistic structure would be simple or elegant.

However, these new and improved binding rules continue to rely on the notion that reflexives and pronouns will be bound or not bound within their domains. They also continue to predict that reflexives and pronouns will be in complementary distribution.

Damn:

(7) Grand ideas about himself occupy John all day

(8a) John boasted that the Queen had invited Lucie and himself for tea

(8b) John boasted that the Queen had invited Lucie and him for tea

(8a) and (8b) demonstrate that pronouns and reflexives, in this case, are not in complementary distribution. (7) provides an example of a reflexive that is not bound by its co-indexed NP—himself occurs before John. It looks like even our new and improved (and more complex) binding rules fail to predict which sentences will or will not be grammatical. These examples could easily be multiplied. And we haven’t even left English!

Of course, linguists continue to re-formulate binding rules that take the above examples into consideration. But in order to herd these cats, things get very complicated very quickly, and many of the papers formulating new binding rules (e.g., Reinhart and Reuland 1993) contain a lot of sentences that begin with “Suppose that . . .” The suppositions may indeed be correct, and, as I said, there was never a guarantee that the rules of UG would be simple. However, for the past 40 years, North American linguistics has been a constant complication of older rules with newer rules as more data (especially cross-linguistic data) comes to the field’s attention. This process of formulation and re-formulation in light of new data, which I have simplistically  illustrated here with Binding Principles, is exactly what linguists do. This process may indeed be expanding our knowledge about the structures of languages and UG. I think it has provided a lot of insight into linguistic structures. But it seems like there can never be closure. There will always be another piece of data to demonstrate that a rule is incomplete or simply incorrect. And unfortunately, the impossibility that the rules being amassed will ever reach closure seems to undermine the entire process. One can’t help agreeing, if only momentarily, with John McWhorter’s warning that the search for the structures of Universal Grammar might look as silly to future scholars as the search for phlogiston looks to us today.

(Or it might not. I don’t know. In the end, the argument I made in the paragraph above is similar to the argument against trying to pin down polygenic traits in humans—it’s just too complicated. And that’s never a productive stance to take.)