Some Quick Text Mining of the 2015 CCCC Program

During CCCC last week, Freddie deBoer made a couple comments about the conference: first, that there weren’t as many panels on the actual work of teaching writing compared to panels on sexier topics, like [insert stereotypical humanities topic here]; and second, that not much empirical research was being presented at the conference.

Testing these claims isn’t easy, but as a first stab, here’s a list of the most frequent unigrams and bigrams in the conference’s full list of presentation titles, as found in the official program. Make of these lists what you will. It’s pretty obvious to me that the conference wasn’t bursting at the seams with quantitative data. Sure, research appears at the head of the distribution, but I’ll leave it to you to concordance the word and figure out how often it denotes empirical research into writers while writing.

Then again, big data was a relatively popular term this year. It was used in titles more often than case studies, though case studies was used more often than digital humanities.

To Freddie’s point, the word empirical only appears 11 times in the CCCC program; the word essay appears only 16 times. Is it therefore fair to say there weren’t many empirical studies on essay writing presented this year? Maybe. Maybe not.

CCCCUnigramsCCCCBigrams

One way to get a flavor for the contexts and connotations of individual words and bigrams is of course to create a text network. I’ve begun to think of text networks as visual concordances.

Here is a text network of the tokens writing, write, writer, writers, writing_courses, classroom, and classrooms in the CCCC program. One thing to notice here is that each of these words is semantically related, but in the panel and presentation titles, they exist in clusters of relatively unrelated words. I had expected to discover a messy, overlapping network with these terms, but they’re rather distinct, as judged by the company they keep in the CCCC program. Even the singular and plural forms of the same noun  (e.g., from classroom to classrooms, writer to writers) form distinct clusters.

CCCCProgramNetwork

In relation to Freddie’s point, this network demonstrates that words or bigrams that are prima facie good proxies for “teaching writing” often do lead us to presentations that are pedagogical in nature. However, just as often, they lead us to presentations that are only tangentially or not at all related to the teaching of writing and to the empirical study of writers while writing.

Thus, writer forms a cluster with FYC, student, and reader but also with identity, ownership, and virtual. The same thing occurs with the other terms, though writing by far occurs alongside the most diverse range of lexical items.

CCCCWriter

CCCCWriters

CCCCClassroom

CCCCWriting

This is about as much work as I’m interested in doing on the CCCC program for now. In my last post, I put a download link for a .doc version of the program, for anyone interested in doing a more thorough analysis, whether to test Freddie’s claims or to test your own ideas about the field’s zeitgeist.

However, it’s always important to keep in mind that a conference program might tell us more about the influence of conference themes than about the field itself.

ADDED: Here is a list of all names listed at the end of the CCCC program (CCCCProgramNames). Problem is, it’s a list of the FIRST and LAST names, with each given its own entry. If someone is inclined, they can go through this list and delete the last names, which will leave you with a file that can be run through a Gender Recognition algorithm, to see what the gender split of CCCC presenters was.

University representation at CCCC

Here’s a list of the universities and colleges best represented at the 2015 CCCC conference. I used NLTK to locate named entities in the CCCC program, so the graph simply represents a raw count of each time a university’s name appears in the program. Some counts might be inflated, but in general, each time a school is named = a panel with a representative from that school.

The graph shows only those schools that were named at least 10 times in the program (i.e., the schools that had at least 10 individual panels). Even in this truncated list, Michigan State dominates. Explanations for this gross inequality in representation are welcome in the comments.

CCCCColleges

Program (in .docx form because WordPress doesn’t allow .txt files)

All Your Data Are Belong To Us

In the blink of an eye, sci-fi dystopia becomes reality becomes the reality we take for granted becomes the legally enshrined status quo:

“One of our top priorities in Congress must be to promote the sharing of cyber threat data among the private sector and the federal government to defend against cyberattacks and encourage better coordination,” said Carper, ranking member of the Senate Homeland Security and Governmental Affairs Committee.

Of course, the pols are promising that data analyzed by the state will remain nameless:

The measure — known as the Cyber Threat Intelligence Sharing Act — would give companies legal liability protections when sharing cyber threat data with the DHS’s cyber info hub, known as the National Cybersecurity and Communications Integration Center (NCCIC). Companies would have to make “reasonable efforts” to remove personally identifiable information before sharing any data.

The bill also lays out a rubric for how the NCCIC can share that data with other federal agencies, requiring it to minimize identifying information and limiting government uses for the data. Transparency reports and a five-year sunset clause would attempt to ensure the program maintains its civil liberties protections and effectiveness.

Obama seems to suggest that third-party “cyber-info hubs”—some strange vivisection of private and public power—will be in charge of de-personalizing data in between Facebook and the NSA or DHS:

These industry organizations, known as Information Sharing and Analysis Organizations (ISAOs), don’t yet exist, and the White House’s legislative proposal was short on details. It left some wondering what exactly the administration was suggesting.

In the executive order coming Friday, the White House will clarify that it envisions ISAOs as membership organizations or single companies “that share information across a region or in response to a specific emerging cyber threat,” the administration said.

Already existing industry-specific cyber info hubs can qualify as ISAOs, but will be encouraged to adopt a set of voluntary security and privacy protocols that would apply to all such information-sharing centers. The executive order will direct DHS to create those protocols for all ISAOs.

These protocols will let companies “look at [an ISAO] and make judgments about whether those are good organizations and will be beneficial to them and also protect their information properly,” Daniel said.

In theory, separating powers or multiplying agencies accords with the vision of the men who wrote the Federalist Papers, the idea being to make power so diffuse that no individual, branch, or agency can do much harm on its own. However, as Yogi Berra said, “In theory there is no difference between theory and practice, but in practice there is.” Mark Zuckerberg and a few other CEOs know the difference, too. They decided not to attend Obama’s “cyber defense” summit in Silicon Valley last week.

The attacks on Target, Sony, and Home Depot (the attacks invoked by the state to prove the need for more state oversight) are criminal matters, to be sure, and since private companies can’t arrest people, the state will need to get involved somehow. But theft in the private sector is not a new thing. When a Target store is robbed, someone calls the police. No one suggests that every Target in the nation should have its own dedicated police officer monitoring the store 24/7. So why does the state need a massive data sharing program with the private sector? It’s the digital equivalent of putting police officers in every aisle of every Target store in the nation—which is likely the whole point.

Target, of course, does monitor every aisle in each of its stores 24/7. But this is a private, internal decision, and the information captured by closed circuit cameras is shared with the state only after a crime been committed. There is no room of men watching these tapes, no IT army paid to track Target movements on a massive scale, to determine who is a possible threat, to mark and file away even the smallest infraction on the chance that it is needed to make a case against someone at a later date.

What Obama and the DHS are suggesting is that the state should do exactly that: to enter every private digital space and erect its own closed circuit cameras, so that men in suits can monitor movement in these spaces whether a crime has been committed or not. (State agencies are already doing it, of course, but now the Obama Administration is attempting to increase the state’s reach and to enshrine the practice in law.)

“As long as you aren’t doing anything wrong, what do you care?”

In the short term, that’s a practical answer. In the future, however, a state-run system of closed circuit cameras watching digital space 24/7 may not always be used for justified criminal prosecution.

The next great technological revolution, in my view, will be the creation of an entirely new internet protocol suite that enables some semblance of truly “invisible” networking, or perhaps the widespread adoption of personal cloud computing. The idea will be to exit the glare of the watchers.

Hindi 101

I’m taking Hindi 101 this semester. The Devangari script feels mildly ornate in my hand compared to the angularity of alphabets descended from the Phoenician script (including the English alphabet), but it is quite lovely and not as challenging as I had imagined. It is still an alphabet, after all, with a much closer sound-grapheme correspondence than one finds in English, where each letter—particularly vowels—can correspond to multiple phonemes. (English grammar is absurdly simple compared to all other major languages, but our spelling system must be a nightmare for foreign learners. There’s something to be said for language academies that control the drift between pronunciation and spelling.) Devanagari does, however, omit some vowel sounds and uses secondary or “dependent” vowel forms in most contexts, so it has something of the syllabary about it. In fact, the biggest mistake I make in class is to confuse two dependent vowels,  ी and  ो. The former is long “ee”, the latter is “o”, but in certain fonts (including my own handwriting), they look nearly identical.

The script’s biconsonantal conjuncts are mostly intuitive, though a few bizarre ones need to be memorized as separate graphemes. We have conjuncts in English, but I believe they are a relatively new innovation with limited usage. One example is the city logo of Huntington Beach, California. Hindi has a lot of these, and they are quite common.

clip_image002_0001.201144028_std

An English biconsonantal conjunct.

Apart from learning a new script, the most enjoyable part of Hindi class has been coming across Romance or Germanic cognates. At an intellectual level, I know and have long known that Hindi and English, both Indo-European languages, share a genetic ancestry, which means that at some point in the distant past all Indo-European speakers spoke the same language. It’s easy to get a handle on the concept when talking about Romance languages: Spanish, Italian, and French all used to be Latin. There, we have a well documented history, stretching back through the Renaissance and middle ages to the familiar  world of Rome. However, when it comes to Proto Indo-European, we are faced with a deeper and wider canyon of time and an ancient world that is mostly unknown to us. The PIE speakers were probably living in the Pontic-Caspian steppe lands, but some evidence suggests that they may have been living in the greater Anatolian region; perhaps the most direct descendants of Proto Indo-Europeans are today’s Armenians, Turks, and Persians. They apparently kicked ass and took names because Indo European now stretches from the Pacific to the Indian Oceans.

But whoever they were, the PIE speakers are remote in a way that the Romans or Germanic tribes are not. Yet while doing my Hindi homework, every now and again I come across a word that clearly indicates the ancient linguistic (and genetic) connectedness between the Romans, the Germans, and the Hindi speakers. Kamiz for shirt; mez for table; kamra for room; mata for mother; pita for father; nam for name; darvaza for door . . . In Hindi class, when I say a word out loud that is clearly related to a European word, I am intoning sounds close to the ones that came from the lips of those ancient Indo-Europeans before they split eastward and westward to conquer Eurasia. To language nerds like me, it’s a chilling sensation.

Distorting time to deny inevitability

The latest issue of Rhetoric Society Quarterly has its authors engaging with “untimely historiography,” which, as near as I can tell, is an attempt to complicate the notion of time as a one-way river of cause and effect. Most of the essays (I’ve read two and skimmed the others) seem to share a common distrust of grand narratives and a distaste for histories that look beyond the contingency of particular events. Cause and effect, linear time—these are human constructs that make sense of distort an otherwise irreducibly complex mess of events.

The chronological anxiety in these essays is of the sort recently addressed by Ted Underwood in Why Literary Periods Mattered. There is of course good reason to be skeptical about grand narratives and historical theories, so I’m sympathetic to much of what is said in these new essays, and I find value in taking a critical look at constructions of linearity in history. However, as genetics blogger Razib Khan notes, acknowledging the dangers of over-generalization presents us with “problems to be grappled with, not a ‘get out of jail’ card to be thrown at any attempts to construct a formal system of interpretation.” Khan’s post is aptly entitled “Human History is Both Contingent and Inevitable,” and I think this both/and worldview is intellectually useful. It makes room for the radical contingency argued for by Michelle Ballif and others without foreclosing on legitimate linear interpretations of history. Thinking about history as both contingent and inevitable leads us to ask where it’s one or the other, to disentangle where it’s more one than the other.

Not everyone would agree with my sentiment, to put it mildly. As an example, I’ll quote from Hans Kellner’s essay “Is History Ever Timely?”*, in which he recounts a talk given by Hayden White:

In 1967, Hayden White . . . journeyed to Colorado to deliver a talk at a conference on biology. At this conference he spoke on the topic “What is a Historical System?” in which he contrasted a historical system with a biological system. In effect, he said that biological—that is, genetic—systems are timely. By this he meant that one’s biological state had been determined in the past by genetic ancestral code. Today we would speak of DNA. But is this true of historical, cultural ancestry? Are we historically determined in the matter of who we are? Is our historical identity as fixed by the timeliness of time and genetic logic as our biological identity is? At that conference, White said, “no.”

A resounding answer, one that, I believe, many scholars in the humanities would echo. It also rejects my olive branch to both sides of the question. It implicitly denies the possibility that culture and history might exhibit large-scale patterns or processes due to the influence of biology, geography, demographics, economics, and so on.

Kellner continues with an example that White used to prove his point: the Christianization of Europe as a culturally created event that needn’t have occurred:

Cultural communities are constituted on the basis of a shared agreement about the choice of historical ancestors. There are times, however, when people lose faith in their chosen identities . . . The example White cited at the time was the crisis of the seventh and eighth centuries in Northern Europe, when a Romanized world saw that the source of their identity had been changed beyond recognition, and a new candidate for that identity had emerged in the teachings of Christian missionaries. As White put it, when the Germanic peoples of northern Europe decided that they were no longer the cultural descendants of ancient Romans or of pagan barbarians, and that their cultural ancestors were Palestinian Jews with whom they had no biological connection at all, a new culture was formed. Backwards. This did not need to happen. Just as the pin on which one sat might have never been noticed if the pain had not caused it to exist for us, so the “Christianization” might have never happened . . .

But is it true that Northern Europe switched identities and cultures as effortlessly as Kellner’s gloss implies? It seems to me a highly contested statement. The Holy Roman Empire was a hegemon among Europe’s warring monarchs and tribes for a time, and, as White describes, the Church Fathers went to great lengths to adopt for themselves and for Europe a foreign Jewish culture and history, but to suggest that the Scots, the Anglos, the Franks, and the Iberians stopped being Scots, Anglos, Franks, and Iberians just because they became Christian is a gross overstatement belied by the constant warfare and power-plays that constitute European history (you’d think White and Kellner would be more careful about hasty generalizations!). It’s like saying the Persians stopped being Persian when they were conquered by the Muslims. Culture runs deep, precisely, I think, because it is tied to and influenced by processes much more intransigent than individual human whim. I don’t believe culture is a costume ready to be changed in a generation or two, and any attempts to do so often result in backlashes or corrections. One might even argue that during the middle ages Europe was just waiting for its monarchs to re-assert their power over Rome so they could all go back to fighting one another again. And indeed they did.

Now, I’m sympathetic to the political sensibility from which I think all this emerges—the idea that if history is not inevitable then the future is, to some extent, in our hands, ready to be constructed in a more just and moral way. On the other hand, if the movement of history is inevitable, then humans can have no agency over their (often unjust) cultures and behaviors, no more agency than they have over their genetics. Such is the “Cormac McCarthy” view of the world, McCarthy having famously said that wishing the species could be “improved in some way . . . will make your life vacuous.” It is an antipathy to this view that brings out the poststructuralist and postmodern tendencies in these RSQ essays, whose authors deny inevitability to history by denying the linear shape of time altogether. Get rid of linear time and any notion of inevitability disappears with it.

I grew up watching wildlife documentaries, so I was inured from a young age to the McCarthy view. It probably didn’t help that I read Blood Meridian in tenth grade. Nevertheless, I try not to err in extremes, so although my default position on culture is determinism of all types—genetic, geographic, demographic, historical—I enjoy challenging and often replacing my default assumptions. I think those who err on the other side—no determinism of any type, history is always contingent—should likewise challenge their default assumption. Hopefully we can meet in the middle.

Hayden White asked:  Are we historically determined in the matter of who we are? Is our historical identity as fixed by the timeliness of time and genetic logic as our biological identity is? He answered no, but I think we should answer, Sometimes yes and sometimes no. It depends on what you’re talking about. The intellectual challenge is to figure out what is (or was) contingent and what is (or was) inevitable. Does history exhibit patterns and cycles? What are the large-scale processes which stand outside of but influence cultural expressions? Do certain cultural expressions change according to broadly identifiable patterns, while others exhibit no patterned changes whatsoever? How do irreducibly contingent moments interact with larger historical processes? Interesting questions, in my opinion, ones that the cliodynamicists are trying to answer mathematically. Will they be successful? Maybe, maybe not. But before the fact, I don’t think we should, to quote Khan again, “throw our hands up in the air and assume that all of history is a contingent darkness from which we can’t infer general patterns.”

 

*Kellner’s essay is a sensible discussion of the ways that texts, films, and images create connections across great gaps of time to re-figure the past in terms of the present. It’s an excellent piece, and I’m simply using these carefully extracted quotes as a foil.

Elliot Rodger’s Manifesto: Text Networks and Corpus Features

Analyzing manifestos is becoming a theme at this blog. Click here for Chris Dorner’s manifesto and here for the Unabomber manifesto.

Manifestos are interesting because they are the most deliberately written and deliberately personal of genres. It’s tenuous to make claims about a person’s psyche based on the linguistic features of his personal emails; it’s far less tenuous to make claims about a person’s psyche based on the linguistic features of his manifesto—especially one written right before he goes on a kill rampage. This one—“My Twisted World,” written by omega male Elliot Rodger—is 140 pages long, and is part manifesto, part autobiography.

I’ve made a lot of text networks over the years—of manifestos, of novels, of poems. Never before have I seen such a long text exhibit this kind of stark, binary division:

RodgersBetweennessCentrality

This network visualizes the nodes with the highest betweenness centrality. The lower, light blue cluster is Elliot’s domestic language; this is where you’ll find words like “friends”, “school,” “house,” et cetera . . . words describing his life in general. The higher, red cluster is Elliot’s sexually frustrated language; this is where you’ll find words like “girls,” “women,” “sex,” “experience,” “beautiful,” “never”  . . . words describing his relationships with (or lack thereof) the feminine half of our species.

It’s quite startling. Although this text is part manifesto and part autobiography, I wasn’t expecting such a clear division: the language Elliot uses to describe his sexually frustrated life is almost wholly severed from the language he uses to describe his life apart from the sex and the “girls” (Elliot uses “girls” far more frequently than he uses “women”—see below). It’s as though Elliot had completely compartmentalized his sexual frustration, and was keeping it at bay. Or trying to. I don’t know how this plays out in individual sections of the manifesto. Nor do I know what it says about Elliot’s mental health more generally. I’ve always believed that compartmentalizing frustrations is, contra popular advice, a rather healthy thing to do. I expected a very, very tortuous and conflicted network to emerge here, indicating that each aspect of Elliot’s life was dripping with sexual angst and misogyny. Not so, it turns out.

Here’s a brief “zoom” on each section:

RodgersDegreeCentralityDomestic

RodgersDegreeCentralityWomen

In the large, zoomed-out network—the first one in the post—notice that the most central nodes are “me” and “my.” I processed the text using AutoMap but decided to retain the pronouns, curious how the feminine, masculine, and personal pronouns would play out in the networks and the dispersion plots. Feminine, masculine, personal—not just pronouns in this particular text. And what emerges when the pronouns are retained is an obvious image of the Personal. Rodgers’ manifesto is brimming with self-reference:

RodgersPronouns

Take that with a grain of salt, of course. In making claims about any text with these methods, one should compare features with the features of general text corpora and with texts of a similar type. The Brown Corpus provides some perspective: “It” is the most frequent pronoun in that corpus; “I” is second; “me” is far down the list, past the third-person pronouns.

Here’s another narcissistic twist, found in the most frequent words in the text. Again,  pronouns have been retained. (Click to enlarge.)

RodgersFreqWords

“I” is the most frequent word in the entire text, coming before even the basic functional workhorses of the English language. The Brown Corpus once more provides perspective: “I” is the 11th most frequent word in that general corpus. Of course, as noted, there is an auto-biographic ethos to this manifesto, so it would be worth checking whether or not other auto-biographies bump “I” to the number one spot. Perhaps. But I would be surprised if “I,” “me,” and “my” all clustered in the top 10 in a typical auto-biography—a narcissistic genre by design, yet I imagine that self-aware authors attempt to balance the “I” with a pro-social dose of “thou.” Maybe I’m wrong. It would be worth checking.

More lexical dispersion plots . . .

Much more negation is seen below then is typically found in texts. According to Michael Halliday, most text corpora will exhibit 10% negative polarity and 90% positive polarity. Elliot’s manifesto, however, bursts with negation. Also notice, below, the constant references to “mother” and “father”—his parents are central characters. But not “mom” and “dad.” I’m from Southern California, born and raised, with social experience across the races and classes, but I’ve never heard a single English-only speaker refer to parents as “mother” and “father” instead of “mom” and “dad.” Was Elliot bilingual? Finally, note that Elliot prefers “girl/s” to “woman/en.”

RodgersGirlsGuys

RodgersMotherFather

RodgersNegation

RodgersSexEtc

Until I discover that auto-biographical texts always drip with personal pronouns, I would argue that Elliot’s manifesto is the product of an especially narcissistic personality. The boy couldn’t go two sentences without referencing himself in some way.

And what about the misogyny? He uses masculine pronouns as often as he uses feminine pronouns; he refers to his father as often as he refers to his mother—although, it is true, the references to mother become more frequent, relative to father, as Elliot pushes toward his misogynistic climax. Overall, however, the rhetorical energy in the text is not expended on females in particular. This is not an anti-woman screed from beginning to end. Also, recall, the preferred term is “girls,” not “women.” Elliot hated girls. Women—middle-aged, old, married, ensconced in careers, not apt to wear bikinis on the Santa Barbara beach—are hardly on Elliot’s radar. (This ageism also comes through in his YouTube videos.) Despite the “I hate all women” rhetorical flourishes at the very beginning and the very end of his manifesto, Elliot prefers to write about girls—young, blonde, unmarried, pre-career, in sororities, apt to wear bikinis on the Santa Barbara beach.

I noticed something similar in the Unabomber manifesto. Not about the girls. About the beginning and ending: what we remember most from that manifesto is its anti-PC bookends, even though the bulk of the manifesto devotes itself to very different subject matter. The quotes pulled from manifestos (including this one) and published by news outlets are a few subjective anecdotes, not the totality of the text .

Anyway. Pieces of writing that sally forth from such diseased individuals always call to mind what Kenneth Burke said about Mein Kampf:

[Hitler] was helpful enough to put his cards face up on the table, that we might examine his hands. Let us, then, for God’s sake, examine them.

 

Demographic distribution: Gender of citations in CCC, RSQ, and RR abstracts

This post follows up on my discussion of citation frequencies in abstracts in rhetoric and composition journals. To reiterate, a safe assumption to make is that citations in abstracts are “central” to the arguments presented and the research undertaken in the articles themselves; they are particularly informative about overall trends. The genre of the humanities article demands more citations than a core argument actually requires, so looking at citations in abstracts should control for that genre requirement, distilling down all citations to the most vital ones.

The journals: College Composition and Communication (CCC), Rhetoric Society Quarterly (RSQ), and Rhetoric Review (RR). The CCC abstracts run from February 2000 (51.3) to September 2011 (63.1), a total of 261 abstracts. The RSQ abstracts run from Winter 2000 (30.1) to Fall 2011 (41.5), a total of 220 abstracts. The RR abstracts run from 2002 (21.3) to 2011 (30.4), a total of 154 abstracts.

The previous post discussed the “long tail” distribution that emerged from the citation frequencies and what it means for disciplinary identity. This post presents information on the gender of the sources cited in the abstracts, then makes a few comments about demographic distributions in general.

There are 79 unique citations in the CCC abstracts; 159 unique citations in the RSQ abstracts; and 121 unique citations in the RR abstracts. (See previous post for .xls data files.) Here’s how the gender distribution falls: in CCC, 23 out of the 79 sources are female; in RSQ, 39 out of the 159 sources are female; in RR, 36 out of the 121 sources are female.

And here are graphs of the raw counts and of the percentages:

Abstract citations by gender (raw count)

Abstract citations by gender (raw count)

Abstract citations by gender (percentage)

Abstract citations by gender (percentage)

In Authoring a Discipline, Maureen Daly Goggin has shown that by 1990 total contributors to 9 of rhetoric and composition’s major journals—including the 3 analyzed here—had equalized to a nearly 50/50 split between males and females. I imagine this trend has continued into the new millennium, but it would be worthwhile to determine whether or not that’s the case.

What has not equalized, however, is the gender contribution in terms of citations. Odds are, counting all citations in the articles themselves would alleviate the large gap seen in the graphs above. But insofar as we accept that abstract citations represent the most vital sources in each journal, then an obvious gender gap still exists in CCC, RSQ, and RR citations.

In RSQ and RR, this gap, in part, likely has something to do with these journals’ tendencies to publish work on rhetorical history. I pointed this out in the last post: 27 (or 22%) of the RR citations are sources from the 17th century or earlier. 26 (or 16%) of RSQ citations are from the same period. Those numbers would grow if they included figures from the 18th and 19th centuries, as well. The reality is, most of these historical sources are male: Plato, Cicero, Aristotle, Quintilian, et cetera.

I have no ready explanation for why CCC citations should have as large a gender gap as the other journals’ citations, given that CCC builds most of its scholarship on sources from the middle part of the 20th century or later. If we look at the 102 most cited figures in CCC between 1987 and 2011 (Mueller, “Grasping”), we discover that 43/102 (42%) of the sources are female: a gender imbalance, but one not nearly as pronounced as the imbalance that surfaces in abstract citations. I’d be curious to see the gender distribution in Mueller’s entire data set. Is there a nearly 50/50 split between male and female sources across all citations in CCC between 1987 and 2011? If so, we could model the gender imbalance in this journal’s citations as an emergent feature: 50/50 across the entire data set; 58/42 in the most popular citations between 1987 and 2011; 71/29 in abstracts between 2000 and 2011. It’s unfortunate that CCC did not publish abstracts until the late 1990s, so that the dates of the abstracts and the articles could be uniform.

The question of demographic balance is one that spills a lot of digital ink. Just this morning, Scott Weingart visualized the gender (im)balance of Digital Humanities Conference attendees: about a 70/30 split that favors males. And Google recently released the demographic characteristics of its workforce: 30% of its employees are women; 17% of its technical employees are women. 60% of its employees are white; 30% of its employees are Asian (read: East Asian and Indian); and only 3% of its employees are Non-Asian Minorities.

I asked Scott why our default assumption should be uniform demographic distribution. When looking at statistical trends that emerge at large scales, we shouldn’t be surprised to discover that human populations cluster differently. At least, that’s my default assumption. The DH Conference draws more males, but then, an Early Childhood Education conference will draw more females. (I once attended a conference on speech and behavior therapy for autistic children; there were no more than three or four males amid about seventy females.) Or take a look at the National Association for the Education of Young Children. Although we often hear about the male-ness of executive boards, the NAEYC’s executive team is entirely female, and its 17-member governing  board boasts 13 females and only 4 males. Looking at all the Early Childhood Education associations and organizations in the country, what gender trends would we expect to find?

The first question to ask about demographic distribution in any particular population (like Google’s workforce or citations in abstracts) is this: What are the characteristics of the larger population from which this particular population is drawing? As long as rhetorical scholars continue to look at rhetorical history, where most of the figures are male, then we can continue to expect many citations in these historical journals to be male. (This may change, however, as more and more rhetorical historians re-discover the history of female oratory.) Or, in Google’s case, if we take the American population as the baseline, assuming a 50/50 gender split, then clearly there is a gender imbalance. But in terms of race and ethnicity, its white workforce is in fact under-represented. Raising the percentage of blacks and Hispanics at Google would mean firing a lot of the Chinese and Indians, unless we want to make whites more under-represented than they already are. (A fairer baseline population would be the percentage of working-age adults in America, or, better yet, the percentage of working-age adults with college degrees; however, those stats are much harder to come by. Total population is a decent but imperfect proxy.)

The point is that we do not always find particular populations boasting a uniform or near-uniform demographic distribution. Why is this? A complex question. Given the totality of the human population (or, more humbly, the totality of any total population in a given geographic area), why do we find the smaller population clusters clustering the way they do around different practices? Why are there more males in CCC citations? Why are there more males at the DH Conference? Why are East Asians and Indians so over-represented at Google? Why are there so few East Asians and Indians in the NFL and the NBA? That populations cluster differently around different practices seems to be a statistical fact. Is it also a future inevitability?