Graphing Citations and Making Sense of Disciplinary Divisions

A Pareto distribution: the troubling result of Derek Mueller’s distant reading of citations in College Composition and Communication: a “long tail” of citations, a handful of names cited many times but exponentially more names cited only once. Out of 8,035 unique citations, 5,761 were cited once and 986 were cited twice. In other words, 84% of citations in CCC occurred only once or twice in a 25 year period.

Troubling, but unsurprising. Physical and social scientists have long known that power law distributions occur across a wide variety of phenomena, including academic citations (Gupta et al. 2005). That a long tail occurs in a rhet/comp journal simply puts our discipline in the same position as everyone else: a small group of scholarly work has gained a “cumulative advantage” or “preferential attachment” and thus become the core set of classic texts recognized by the field, while most other scholars fail to produce texts that cross the tipping point toward their own preferential attachment. It is usually assumed that this core group of scholars is what unites a discipline. To some extent, the assumption is probably true. However, Mueller is right to ask how far a citation trail can lead away from that core group of scholars before we start questioning just how unified a discipline really is.

When graphing citation counts, it’s not problematic to discover a steep drop between the most cited scholar and the tenth most cited scholar; nor is it problematic that most sources are cited infrequently. The problem is not the long tail. The problem, in CCC’s case, is that the long tail very rapidly approaches a value equal to one. This indicates that any given source in CCC is valuable to the scholar citing it but effectively worthless to everybody else who publishes in the journal. If most citations occurred three, four, five times, even that would suggest a certain unity of purpose—what one scholar has found valuable, several others have found valuable as well, in various issues and various contexts. But when the long tail is mostly comprised of sources cited once and never again? That requires a more robust explanation than a nod toward a core group of scholars can provide. Mueller thus raises the right question:

Although we do not at this time have data from all of the major journals to investigate this fully, the changing shape of the graphed distribution reiterates more emphatically a question only hinted at . . . but one nevertheless crucial to the idea of a common disciplinary domain: How flat can the citation distribution become before it is no longer plausible to speak of a discipline?

To answer Mueller’s call for more data, I have compiled article abstracts from CCC and two other major journals in the field—Rhetoric Society Quarterly and Rhetoric Review. I intend this post to serve as a tentative response to the question posed by Mueller at the end of this quote.  The CCC abstracts run from February 2000 (51.3) to September 2011 (63.1), a total of 261 abstracts. The RSQ abstracts run from Winter 2000 (30.1) to Fall 2011 (41.5), a total of 220 abstracts. The RR abstracts run from 2002 (21.3) to 2011 (30.4), a total of 154 abstracts.

Only abstracts, not full articles. However, because only the most important citations appear in abstracts, I think tallying abstract citations offers the best chance to shorten the long tail and partially alleviate the implications of Mueller’s work. It is not a slight to the humanities to point out that articles demand more citations than their arguments actually require: many article citations can be removed without affecting anything vital to an argument. Citations in abstracts, on the other hand, are in most cases central to the argument or study undertaken. If we count only the most important sources in each journal—the ones that surface in abstracts—is the long tail of citation distributions less pronounced? We can expect to discover a long tail. That’s a mathematical inevitability. But if a journal—to say nothing of an entire discipline—is somehow unified, citations in abstracts should have a slightly less extreme power law distribution than citations in the articles themselves. Abstract citations are the “cream of the crop,” those vital enough to make it into the space constraints of the abstract genre: we hope to find fewer citations and therefore a graph that does not drop so precipitously toward x=1.

Methods: Each corpus was uploaded to the Natural Language Toolkit and tagged for part of speech. Then I compiled proper nouns. The proper noun list was larger than but included proper names. I extracted these names—noun forms (e.g. ‘Burke’ or ‘Burke’s) and adjective forms (e.g. ‘Burkean’)—and tracked them across the abstracts. I compiled each unique citation as well as the number of times each was cited in an abstract.

Finding citation names

Finding citation names

Here are spreadsheets with the unique citations and their citation counts in each abstracts corpus: College Composition and Communication. Rhetoric Society Quarterly. Rhetoric Review.

There are 79 unique citations in the CCC abstracts; 159 unique citations in the RSQ abstracts; and 121 unique citations in the RR abstracts. Only six citations occur in both the RSQ and CCC abstracts corpora: Mina Shuaghnessy, Kenneth Burke, John Dewey, Donald Davidson, Peter Elbow, and Mikhail Bakhtin. When factoring in RR, only Kenneth Burke, John Dewey, and Peter Elbow are shared across all three corpora. RR and RSQ share quite a few sources, almost all of which are historical figures—Plato, Aristotle, Cicero, Isocrates, and the like. Kenneth Burke is the most frequently cited source in each abstracts corpus: he is cited in 5 separate abstracts in CCC, 17 in RSQ, and 14 in RR. Maybe “rhetoric and composition” should be changed to “Burkean studies.” No surprise—the man has his own journal.

Based on the raw count of unique citations in each journal—on average, less than one per abstract—I think my original suggestion is at least partially correct: counting citations in abstracts controls for the rhetorical demand of articles to cite more sources than necessary. Abstract citations are the stars of the show. Nevertheless, after graphing the citations, Pareto distributions did emerge:

CCC abstract citations

CCC abstract citations

RSQ abstract citations

RSQ abstract citations

RR abstract citations

RR abstract citations

Citations in the CCC abstracts occurred in a slightly more even distribution than citations in CCC articles (c.f., Mueller). But then, there aren’t many citations in this corpus, relative to the RSQ and RR corpora. Among the citations that do appear, none occur in numbers much greater than those occurring in only one abstract. The citation occurring most frequently—Burke—occurs in five abstracts. Does this graph confirm Mueller’s conclusion about a dappled CCC? To some extent, yes. There’s still a long tail, after all . . .

RSQ citations even more obviously display the Pareto distribution discussed in Mueller’s article. The citations occurring most frequently—Burke and Plato—surface in 17 and 14 abstracts, respectively.

The distribution in RR is also uneven, and the drop of the long tail is even more precipitous than the one in RSQ. Burke is cited in 14 abstracts and the next most frequent source, Aristotle, is cited in 5 abstracts.

These graphs indicate that even in article abstracts—where only the most vital sources are invoked—a small canon of core scholars emerges beside an otherwise long, flat, dapple distribution of citations. More divergence and specialization, then—not just in CCC but in RR and RSQ.

I think there’s more to it than disciplinary divergence, however. These long tails can undoubtedly be explained mathematically—the conclusion: they’re inevitable—but in this particular case they might also be explainable in prosaic terms. And I believe this prosaic explanation makes sense of the long tail in a way that salvages a shred of disciplinary unity within each journal:

In RR and RSQ, for example, an obvious citation pattern emerges. Five of the ten most cited sources in the RSQ abstracts are historical figures: Plato, Aristotle, Quintilian, Blair, and Cicero. In RR, the exact same thing: Aristotle, Cicero, Isocrates, Plato, Quintilian. But glancing through the long tail in both citation counts, historical figures continue to emerge, mostly from the Greco-Roman world, but from beyond it, as well. In the CCC long tail, on the other hand, historical figures occur in less frequent numbers, and only two pre-19th century.

Raw numbers for RR and RSQ: 27 (or 22%) of the RR citations are sources from the 17th century or earlier. 26 (or 16%) of RSQ citations are from the same period. Most are Greco-Roman sources, but Confucius, Montaigne, and Averroes are also scattered throughout the long tail. We might conclude, then, that a decently sized community of historians of rhetoric communicate in RSQ and RR (when they’re not communicating in Rhetorica, presumably). Their communication adds to the long tail, but does it signify disciplinary divergence and specialization?

Rather, here is one disciplinary community—historians of rhetoric—mapped out in unity. Its borders extend slightly into CCC but its principal territory lies in RSQ and RR. An obvious outcome, if you’re involved in the field. However, it also helps us make partial sense of that worrying Pareto distribution: not all of the singular citations that constitute the long tail are as disconnected as the graphs lead us to believe. In RSQ and RR, many singular citations could be grouped together: Plutarch, Laertius, Strabo, Aristophanes—these are, at least, not as indicative of a dappled disciplinary identity as, say, St. Paul and Steven Mailloux.

The same point can be made with pedagogy in the CCC abstracts. It is not surprising, of course, that CCC is home to scholars citing pedagogically-inclined sources; however, for a second time, this obvious point helps make sense of the Pareto distribution of citations presented here and in Mueller’s article: Charles Pierce, Mina Shuaghnessy, Melvin Tolson, Les Perelman—each appears only once, scattered throughout the long tail of abstract citations. But each is invoked for its direct relevance to writing pedagogy. Viewed in this way, the flat distribution of citations seems a little less dappled.

Advertisements

6 thoughts on “Graphing Citations and Making Sense of Disciplinary Divisions

  1. Pingback: Demographic distribution: Gender of citations in CCC, RSQ, and RR abstracts |

  2. I read this again. Nice job digging deeper into the tails to get a better interpretation. It’s certainly true that if everyone’s talking to the same group of dead people, the centrality of that group of dead people lends cohesion to the people talking to them, by extension. But I think the issue still remains of how fragmented the field is if moderns aren’t talking to one another — Plato isn’t collating and brokering what someone argued back at him in 2014, to other scholars in the field. Are scholars diverging off in their own directions with their own personal interpretations of old texts?

  3. Pingback: “Re-purposing Data” in the Digital Humanities |

  4. Isn’t it pretty well established that citations in a variety of disciplines follow a power law? I thought I read that toward the end of the Newman networks text. And I think power laws showcase the preferential attachment of the network — that is — scholars huddled around a small set of ideas. That would seem to be cohesion and unity. A perfectly uniform distribution of citations would indicate purely random citation. We should expect grouping and consensus (even if consensus around wrong ideas!) in academic disciplines.

    • Power laws exist just about everywhere you look, especially when it comes to anything related to language or writing (and citations in journals still fall in that category, I think). Zipf’s Law is obvious in almost every natural language corpus, no matter what language it comes from. I even found a Pareto distribution when compiling the health of Native American languages: https://technaverbascripta.wordpress.com/2013/01/25/the-pareto-distribubution-of-native-american-language-speakers/

      So, in one sense, yes, I agree with the traditional wisdom that would say that this indicates unity when it comes to citations: “scholars huddled around a small set of ideas,” like you say. But since power laws are just inherent in linguistic phenomena across the board, I’m willing to consider various arguments about their existence in any individual context.

      E.g., I think it’s also possible to argue that a more even distribution in this case might indicate, not randomness, but the fact that a source one scholar finds valuable, a significant number of others find valuable as well. What does it mean that 80+% of citations in a discipline are “throw away,” as it were? Maybe it has nothing to do with unity or disunity, but rather just scholarly probing that does or does not catch on.

      Also, in some journals, it looks like the “core” group of citations are, in terms of what kind of content the sources are likely addressing, actually pretty different if you look at them, although not entirely different—that’s what the last paragraph in the post was trying to say.

      Amyway, thanks for the value-added comment! Do you have any links to studies like this done in other disciplines?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s