A Distant Reading of my Autosomes

It’s amazing what spit in a vial can tell you.

When I mailed off the AncestryDNA kit, I figured I already knew the results, barring any family-shattering revelations. (One student I had at Syracuse, a Methodist, told me he turned out to be damn near half Jewish.) I, on the other hand, knew what the thousand-foot view should look like, if not the granular details.

My father has green eyes and sandy hair and, as far as family lore goes, is a British Isles mongrel. My mother is not two full generations out of Mexico; both her paternal and maternal sides are 100% Mexican. On average, Mexicans exhibit a 60/40 ancestral split between Europe and indigenous America (per Analabha Basu et al 2008). So, assuming my mother = 60/40 Spaniard/Amerind split, and assuming my father = 100 Northern European, I assumed, to a rough approximation, that I’d be an 80/20 Euro/Amerind mix. (In the parlance of Oklahomans, I assumed I’d be “1/5 Cherokee”.)

AncestryDNA returned no major surprises. I’m 80% European, 16-19% Amerind. 



Mostly Irish, some Scandinavian, trace amounts of British/W. European.

Quite surprised at how Irish the results say I am. I assumed plenty of English, plus a wonderfully American smattering of other Northern, Western, and Central European ancestral regions. I assumed wrongly.


23andMe does not differentiate between English and Irish, but according to AncestryDNA’s white paper, their reference panel is thus differentiated. Here’s the PCA for their European reference panel.


The dark blue/orange cluster is the Irish/English cluster. They’re damn close but distinct enough. In fact, according to the same white paper, Ancestry’s methods have an 80+% accuracy rate for correctly putting the Irish in Ireland (their methods are actually not great at differentiating England and Western Europe).


Note: In bar graph above, the shorter the bar, the better the predictive accuracy.

Interesting genetic detail. Irish! In my opinion, however, AncestryDNA’s “Irish” ancestral region should be labeled the “Celtic” ancestral region, for it also includes Scotland, Wales, and the borderlands.


So, my father is almost entirely of Celtic stock. I guess that makes me a halfbreed Celt. I had expected a non-trivial amount of English and W. Euro ancestry, but I have only trace amounts of it.


As far as the 15% Scandinavian: According to AncestryDNA, “Scandinavian” shows up in a lot of British Islanders, and I’m no exception. It’s not “Viking” DNA necessarily, but it could be pretty deep ancestry that can’t be traced back “genealogically.” AFAIK, I have no recent Scandinavian surnames in my paternal lineage.


Southern European (Spanish/Italian) and Amerindian. What else did you expect from a Mexican?

Surprised to see the Southern European ancestry isn’t a neat Iberian chunk. It’s split all along the Mediterranean, from Spain to Greece. AncestryDNA isn’t great at figuring out Southern European ancestry, particularly from the Iberian peninsula (see the bar graph accuracy rates above—only 50% for Iberia!).


16-19% confidence range for the Native American ancestry. There are tools to locate that ancestry into “tribal” regions, so it’ll be interesting to determine in the coming months if it’s Mayan or Aztecan. Almost certainly the latter, given the region in which my recent-ish maternal ancestors lived:


Central Mexico. Aguascalientes and Zacatecas. Far too north for the Mayans. Zacatecas comes from the Aztec word zacatl. Doesn’t mean my native ancestors were Aztec. That would be bad ass, but they just as likely could have been from the less civilized nomadic tribes the Aztecs called chichimeca, or barbarians.

The fact that my recent-ish Mexican ancestors are from Zacatecas and Aguascalientes fits well with my family history. Like many third and fourth generation Mexicans, my greats and grandparents came over in the 1910s and 1920s, during the Mexican Revolution. And, indeed, Zacatecas/Aguascalientes was the site of some of the Revolution’s most brutal fighting and was thus a prime source of origin for early 20th century Mexican immigrants.

Also, according to Wikipedia, San Luis Potosi, which is right next door to Zacatecas, is home to a non-trivial number of Italians. This might explain why my Southern Euro ancestry has an Italian component. Cross-state dalliances.

On the same Italian note, Ancestry provides you with cousin matches out to the sixth degree, for people who have also taken the test and appear to be related to you. Popping up in my matches are several people from the Italo-Mexican region of San Luis Potosi! Fourth and fifth cousins, extremely high probability. Only one of them has a picture; I won’t post it here, but she looks very Italian, not at all mestizo.

There’s some African and Middle Eastern noise, which, if legit, certainly comes from my mother’s side.


On average (again, per Analabha Basu et al.), Mexicans exhibit a small amount—roughly 4%—of African ancestry, a legacy of slavery sur de la frontera. Makes sense that a tiny amount would end up in my maternal lineage.

The Middle Eastern trace is probably a pulse from the Old World (Je Suis Charles Martel). It’s possible the M.E. trace could have sperm’ed or egg’ed its way into my lineage in Mexico, but since Lebanese and other Arabs didn’t start arriving in Mexico until the late 19th century, I doubt that’s the case.

Raw data. 

This is just the beginning. I’ve downloaded my 700,000 SNPs and indels and am looking forward to uploading the data to other tools to match against other databases. I’ll also be looking for ways these genetic testing algorithms might be valuable for analysis of large textual data sets.



Relinquishing Control

Responding to Allington et. al’s argument that the digital humanities are a handmaiden to neoliberalism and non-progressive scholarship, Juliana Spahr, Richard So, and Andrew Piper respond that DH and progressive scholarship are not in fact incommensurable. Without getting too deep into the many contours of the debate, I want to suggest in this post what I think may be the hidden crux of the argument (though I doubt the authors of either essay would agree with me).

Spahr et al.:

Ultimately what has most troubled us about Allington et al’s essay is its final line, which is its core assertion: they call on colleagues in the humanities to resist the rise of the digital humanities. They have carefully studied the field of the digital humanities and declare that it must be shut down; nothing good can come from it. We worry about this foreclosing of possibility. Other academic disciplines, such as sociology, have benefited greatly from the merging of critical and computational modes of analysis, particularly in overturning entrenched notions of gender or racial difference based on subjective bias. We find it is too early to reject in toto the use of digital methods for the humanities.

The urgent questions articulated by “Neoliberal Tools” thus present a rich opportunity to think about the field’s methodological potential. Questions about the over-representation of white men or the disproportionate lack of politically progressive scholarship in the digital humanities regard inequality and have a strong empirical basis. As such, they cannot be fully answered using the critical toolbox of current humanistic scholarship. These concerns are potentially measurable, and in measuring them, the full immensity of their impact becomes increasingly discernable, and thus, answerable. The informed and critical use of quantitative and computational analysis would thus be one way to add to the disciplinary critique that the authors themselves wish to see.

In these final paragraphs, the authors make the  move—an almost imperceptible move, but I think I can detect it—that anyone in the hard or social sciences must also make: they separate data from explanations for data. This, in my view, is what makes the ostensibly “progressive” or “activist” goals of some humanities scholarship somewhat incommensurable with computational work as such.

Questions about equality, the authors note, are questions that require large-scale measurement; they are not questions one can address adequately through close readings or selective anecdotes, which they describe as “the critical toolbox of current humanistic scholarship.” What they do not note—but I think it’s a point Allington et al. might eventually get around to making in a counter-argument—is that when you exchange a close, humanistic analysis for a data-driven one, then to a certain extent you relinquish control over the “correct” way to explain or theorize the resultant measurements. Indeed, our results, we now recognize, are far too easily “rationalized” with just-so stories that fit our pre-conceived notions (which isn’t to say that some just-so stories aren’t also true stories, or that some just so-stories aren’t truer than others).

“The data’s the data,” a biologist friend of mine once said. “It’s how you explain the data that gives rise to debates.”

Take Ted Underwood’s piece on gender representation in fiction, which Spahr et al. point to as an example of critical/computational scholarship. Underwood writes that between 1800 and 1989,  the words associated with male vs. female characters are volatile and in fact become more volatile in the twentieth century, making it more difficult for models to predict whether a set of words is being applied to a male or a female. “Gender,” he concludes, “is not at all the same thing in 1980 that it was in 1840.”

“Ah, gender is fluid,” we might conclude. Solid computational evidence for feminist theory. But then Underwood makes the data-grounded move, noting that cause(s) of the trend are open to interpretation and further data exploration:

The convergence of all these lines on the right side of the graph helps explain why our models find gender harder and harder to predict: many of the words you might use to predict it are becoming less common (or becoming more evenly balanced between men and women — the graphs we’ve presented here don’t yet distinguish those two sorts of change.)

Whether previously gendered terms converge toward both male and female characters, or whether gender-predicting terms simply disappear in fiction, could very much make a difference from the standpoint of explanation, especially critical or political explanation. E.g., one could claim, given the latter case (disappearance of gender-predicting terms), that what we see at work is the ignoring of gender rather than the fluid reframing of it, an effect, say, of feminism on fiction but not in any sense a confirmation of the essential fluidity of gender. However, it would also be perfectly feasible to use either explanation to forward a more critical or activist-minded thesis. It could go either way. And there’s the rub. When you’re doing computational work, you cannot also at the same time be explaining your results. Explanation is step two, and it’s a step people can take in different directions, politically friendly, politically unfriendly, or politically neutral.

And if the computational work you’re doing is interesting, you should at least sometimes find things that overturn your preconceived notions.  For example, Underwood notes that despite the general trend away from sharply-delineated gender descriptions, there are some important counter-trends.

On balance, that’s the prevailing trend. But there are also a few implicitly gendered forms of description that do increase. In particular, physical description becomes more important in fiction (Heuser and Le-Khac 2012).

And as writers spend more time describing their characters physically, some aspects of the body and dress also become more important as signifiers of gender. This isn’t a simple, monolithic process. There are parts of the body whose significance seems to peak at a certain date and then level off — like the masculine jaw, maybe peaking around 1950?

Other signifiers of masculinity — like the chest, and incidentally pockets — continue to become more and more important. For women, the “eyes” and “face” peak very markedly around 1890. But hair has rarely been more gendered (or bigger) than it was in the 1980s.

Rethinking things, perhaps we don’t see evidence that “gender is fluid” so much as evidence that gender remains sharply delineated, just along a different terminological axis than was previously the case. Or not. You could argue something else, too. Again, that’s the point.

As another example of what I’m talking about, we can look at Juliana Spahr’s and Stephanie Young’s work on the demographics of MFA and English PhD programs. It is an excellent piece, tied resolutely to statistics, but it ends this way:

We have ended this article many different ways, made various arguments about what is or what might be done. These arguments now seem either inadequate (reformist) or unrealistic (smash the MFA, the AWP, the private foundations, the state). At moments we struggled with our own structural positions even as these structures were created without our consent but to our advantage . . .

. . . we agree with McGurl when he argues that “[w)hat is needed now […] are studies that take the rise and spread of the creative writing program not as an occasion for praise or lamentation but as an established fact in need of historical interpretation: how, why, and to what end has the writing program reorganized U.S. literary production in the postwar period?” For us, for now, the best we can do is work to understand so that, when we create alternatives to the program, they do not amplify its hierarchies.

More research needed, in other words. Any previous calls to activism muted.

Spahr and Young do a wonderful job compiling relevant demographic information, but in so doing, they rightly recognize that interpreting the information (both historically and in the present moment) is another job altogether. The data are separated from their explanation. Spahr and Young are, I imagine, on the political left, but their data remain open to explanation from multiple political or apolitical perspectives.

From an apolitical perspective, I would want to explain some of their demographic data with simple demography. For example, they imply that 29% non-white representation in English PhD programs is not enough, but America is precisely 29% non-white and 71% white, so I don’t find that statistic problematic at all. I would also claim that this same demographic point partially ameliorates the 18% non-white representation in MFA programs, though obviously, a gap in representation remains. How to explain it, though? Their essay is (rightly) not ideological enough to foreclose on all but a single, left-facing window of possibility. This is a good thing. Recognizing the possibility of multiple explanations is what keeps a field of inquiry from becoming an ideological echo-chamber.

Spahr et al. also point to sociology as a field that uses computational methods to address critical, cultural questions. But again, addressing critical or cultural questions with computational methods is not at all the same thing as being critical, culturally progressive, or activist. Sociologists (and psychologists) have, I think, always recognized, if only quietly, that progressive or activist readings of their data are by no means the only readings. Steven Pinker and Jon Haidt, among others, are really pushing the point lately with their Heterodox Academy. It’s all a big debate, of course, but that’s the point.

In my view, good computational scholarship opens up debate and rarely points to One Single And Obvious And You’re Stupid If You Don’t Believe It conclusion. Sometimes it does, but that’s usually in the context of not-immediately-political content (e.g., whether or not Piraha possesses recursive syntax). But when you’re talking about large social or political explanations, I’ve never seen the explanation that doesn’t leave me thinking: Mm. Maybe. Interesting. I dunno. We’ll see.

I’m sure my skepticism comes across as conservatism to some. From my perspective as a scholar, however, I’m simply tentative about my own worldview. I’m therefore deeply suspicious of any scholar or study purporting to provide 100% support for any particular ideology or political platform. So I think it’s a good thing that a lot of DH work doesn’t do that. Indeed, I’m drawn most often to theories that piss off everyone across the political spectrum—e.g., Gregory Clark’s work—because my most deeply held prior is that the world as it is probably won’t conform very often to any particular ideology or politics. If anything, then, I’d like to see more DH work not confirming a single orthodoxy but challenging many orthodoxies all at once. Then I’ll be confident it’s doing something right.


Readability formulas

Readability scores were originally developed to assist primary and secondary educators in choosing texts appropriate for particular ages and grade levels. They were then picked up by industry and the military as tools to ensure that technical documentation written in-house was not overly difficult and could be understood by the general public or by soldiers without formal schooling.

There are many readability metrics. Nearly all of them calculate some combination of characters, syllables, words, and sentences; most perform the calculation on an entire text or a section of a text; a few (like the Lexile formula) compare individual texts to scores from a larger corpus of texts to predict a readability level.

The most popular readability formulas are the Flesch and Flesch-Kincaid.


Flesch readability formula



Flesch-Kincaid grade level formula

The Flesch readability formula (last chapter in the link) results in a score corresponding to reading ease/difficulty. Counterintuitively, higher scores correspond to easier texts and lower scores to harder texts. The highest (easiest) possible score tops out around 120, but there is no lower bound to the score. (Wikipedia provides examples of sentences that would result in scores of -100 and -500.)

The Flesch-Kincaid grade level formula was produced for the Navy and results in a grade level score, which can be interpreted also as the number of years of education it would take to understand a text easily. The score has a lower bound in negative territory and no upper bound, though scores in the 13-20 range can be taken to indicate a college or graduate-level “grade.”

So why am I talking about readability scores?

One way to understand”distant reading” within the digital humanities is to say that it is all about adopting mathematical or statistical operations found in the social, natural/physical, or technical sciences and adapting them to the study of culturally relevant texts. E.g., Matthew Jocker’s use of the Fourier transform to control for text length; Ted Underwood’s use of cosine similarity to compare topic models; even topic models themselves, which come out of information retrieval (as do many of the methods used by distant readers); these examples could be multiplied.

Thus, I’m always on the lookout for new formulas and Python codes that might be useful for studying literature and rhetoric.

Readability scores, it turns out, have sometimes been used to study presidential rhetoric—specifically, they have been used as proxies for the “intellectual” quality of a president’s speech-writing. Most notably, Elvin T. Lim’s The Anti-Intellectual Presidency applies the Flesch and Flesch-Kincaid formulas to inaugurals and States of the Union, discovering a marked decrease in the difficulty of these speeches from the 18th to the 21st centuries; he argues that this  decrease should be understood as part and parcel of a decreasing intellectualism in the White House more broadly.

Ten seconds of Googling turned up a nice little Python library—Textstat—that offers 6 readability formulas, including Flesch and Flesch-Kincaid.

I applied these two formulas to the 8 spoken/written SotU pairs I’ve discussed in previous posts. I also applied them to all spoken vs. all written States of the Union, copied chronologically into two master files. Here are the results (S = spoken, W = Written):


Flesch readability scores for States of the Union. Lower score = more difficult.


Flesch-Kincaid grade level scores for States of the Union.

The obvious trend uncovered is that written States of the Union are a bit more difficult to read than spoken ones. Contra Rule et al. (2015), this supports the thesis that medium matters when it comes to presidential address. Presidents simplify (or as Lim might say, they “dumb down”) their style when addressing the public directly; they write in a more elevated style when delivering written messages directly to Congress.

For the study of rhetoric, then, readability scores can be useful proxies for textual complexity. It’s certainly a useful proxy for my current project studying presidential rhetoric. I imagine they could be useful to the study of literature, as well, particularly to the study of the literary public and literary economics. Does “reading difficulty” correspond with sales? with popular vs. unknown authors? with canonical vs. non-canonical texts? Which genres are more “difficult” and which ones “easier”?

Of course, like all mathematical formula applied to culture, readability scores have obvious limitations.

For one, they were originally designed to gauge the readability of texts at the primary and secondary levels; even when adapted by the military and industry, they were meant to ensure that a text could be understood by people without college educations or even high school diplomas. Thus, as Begeny et al. (2013) have pointed out, these formulas tend to break down when applied to complex texts. Flesch-Kincaid grade level scores of 6 vs. 10 may be meaningful, but scores of, say, 19 vs. 25 would not be so straightforward to interpret.

Also, like most NLP algorithms, the formulas take as inputs things like characters, syllables, and sentences and are thus very sensitive to the vagaries of natural language and the influence of individual style. Steinbeck and Hemingway aren’t “easy” reads, but because both authors tend to write in short sentences and monosyllabic dialogue, their texts are often given scores indicating that 6th grades could read them, no problem. And authors who use a lot of semi-colons in place of periods may return a more difficult readability score than they deserve, since all of these algorithms equate long sentences with difficult reading. (However, I imagine this issue could be easily dealt with by marking semi-colons as sentence dividers.)

All proxies have problems, but that’s never a reason not to use them. I’d be curious to know if literary scholars have already used readability scores in their studies. They’re relatively new to me, though, so I look forward to finding new uses for them.


Cosine similarity parameters: tf-idf or Boolean?

In a previous post, I used cosine similarity (a “vector space model”) to compare spoken vs. written States of the Union. In this post, I want to see whether and to what extent different metrics entered into the vectors—either a Boolean entry or a tf-idf score—change the results.

First, here’s a brief recap of cosine similarity: One way to quantify the similarity between texts is to turn them into term-document matrices, with each row representing one of the texts and each column representing every word that appears in both of the texts. (The matrices will be “sparse” because each text contains only some of the words across both texts.) With these matrices in hand, it is a straightforward mathematical operation to treat them as vectors in Euclidean space and calculate their cosine similarity with the Euclidean dot product formula, which returns a metric between 0 and 1, where 0 = no words shared and 1 = exact copies of the same text.

. . . But what exactly goes into the vectors in these matrices? Not words from the two texts under comparison, obviously, but numeric representations of the words. The problem is that there are different ways to represent words as numbers, and it’s never clear which is the best way. When it comes to vector space modeling, I have seen two common methods:

The Boolean method: if a word appears in a text, it is represented simply as a 1 in the vector; if a word does not appear in a text, it is represented as a 0.

The tf-idf method: if a word appears in a text, its term frequency-inverse document frequency is calculated, and that frequency score appears in the vector; if a word does not appear in a text, it is represented as a 0.

In my previous post, I used this Python script (compliments to Dennis Muhlstein) which uses the Boolean method. Tf-idf scores control for document length, which is important sometimes, but I wasn’t sure if I wanted to ignore length when analyzing the States of the Union—after all, if a change in medium induces a change in a speech’s length, that’s a modification I’d like my metrics to take note of.

But how different would the results be if I had used tf-idf scores in the term-document matrices, that is, if I had controlled for document length when comparing written vs. spoken States of the Union?

Using Scikit-learn’s TfidfVectorizer and its cosine similarity function (part of the pairwise metrics module), I again calculated the cosine similarity of the written and spoken addresses, but this time using tf-idf scores in the vectors.

The results of both methods—Boolean and tf-idf—are graphed below.


I graphed the (blue) tf-idf measurements first, in decreasing order, beginning with the most similar pair (Nixon’s 1973 written/spoken addresses) and ending with the most dissimilar pair (Eisenhower’s 1956 addresses). Then I graphed the Boolean measurements following the same order. I ended each line with a comparison of all spoken and all written States of the Union (1790 – 2015) copied chronologically into two master files.

In general, both methods capture the same general trend though with slightly different numbers attached to the trend. In a few cases, these discrepancies seem major: With tf-idf scores, Nixon’s 1973 addresses returned a cosine similarity metric of 0.83; with Boolean entries, the same addresses returned a cosine similarity metric of 0.62. And when comparing all written/spoken addresses, the tf-idf method returned a similarity metric of 0.75; the Boolean method returned a metric of only 0.55

So, even though both methods capture the same general trend, tf-idf scores produce results suggesting that the spoken/written pairs are more similar to each other than do the Boolean entries. These divergent results might warrant slightly different analyses and conclusions—not wildly different, of course, but different enough to matter. So which results most accurately reflect the textual reality?

Well, that depends on what kind of textual reality we’re trying to model. Controlling for length obviously makes the texts appear more similar, so the right question to ask is whether or not we think length is a disposable feature, a feature producing more noise than signal. I’m inclined to think length is important when comparing written  vs. spoken States of the Union, so I’d be inclined to use the Boolean results.

Either way, my habit at the moment is to make parameter adjustment part of the fun of data analysis, rather than relying on the default parameters or on whatever parameters all the really smart people tend to use. The smart people aren’t always pursuing the same questions that I’m pursuing as a humanist who studies rhetoric.


Another issue raised by this method comparison is the nature of the cosine similarity metric itself. 0 = no words shared, 1 = exact copies of the same text, but that leaves a hell of a lot of middle ground. What can I say, ultimately, and from a humanist perspective, about the fact that Nixon’s 1973 addresses have a cosine similarity of 0.83 while Eisenhower’s 1956 addresses have a cosine similarity of 0.48?

A few days ago I found and subsequently lost and now cannot re-find a Quora thread discussing common-sense methods for interpreting cosine similarity scores, and all the answers recommended using benchmarks: finding texts from the same or a similar genre as the texts under comparison that are commonly accepted to be exceedingly different or exceedingly similar (asking a small group of readers to come up with these judgments can be a good idea here). So, for example, if using this method on 19th century English novels, a good place to start would be to measure, say, Moby Dick and Pride and Prejudice, two novels that a priori we can be absolutely sure represent wildly different specimens from a semantic and stylistic standpoint.

And indeed, the cosine similarity of Melville’s and Austen’s novels is only 0.24. There’s a dissimilarity benchmark set. At the similarity end, we might compute the cosine similarity of, say, Pride and Prejudice and Sense and Sensibility.

Given that my interest in the State of the Union corpus has more to do with mode of delivery than individual presidential style, I’m not sure how to go about setting benchmarks for understanding (as a humanist) my cosine similarity results—I’m hesitant to use the “similarity cline” apparent in the graph above because that cline is exactly what I’m trying to understand.


Structuralist Methods in a Post-Structuralist Humanities

The topic of this conference (going on now!) at Utrecht University raises an issue similar to the one raised in my article at LSE’s Impact Blog: DH’ists have been brilliant at mining data but not always so brilliant at pooling data to address the traditional questions and theories that interest humanists. Here’s the conference description (it focuses specifically on DH and history):

Across Europe, there has been much focus on digitizing historical collections and on developing digital tools to take advantage of those collections. What has been lacking, however, is a discussion of how the research results provided by such tools should be used as a part of historical research projects. Although many developers have solicited input from researchers, discussion between historians has been thus far limited.

The workshop seeks to explore how results of digital research should be used in historical research and to address questions about the validity of digitally mined evidence and its interpretation.

And here’s what I said in my Impact Blog article, using as an example my own personal hero’s research in literary geography:

[Digital humanists] certainly re-purpose and evoke one another’s methods, but to date, I have not seen many papers citing, for example, Moretti’s actual maps to generate an argument not about methods but about what the maps might mean. Just because Moretti generated these geographical data does not mean he has sole ownership over their implications or their usefulness in other contexts.

I realize now that the problem is still one of method—or, more precisely, of method incompatibility. And the conference statement above gets to the heart of it.

Mining results with quantitative techniques is ultimately just data gathering; the next and more important step is to build theories and answer questions with that data. The problem is, in the humanities, that moving from data gathering to theory building forces the researcher to move between two seemingly incommensurable ways of working. Quantitative data mining is based on strict structuralist principles, requiring categorization and sometimes inflexible ontologies; humanistic theories about history or language, on the other hand, are almost always post-structuralist in their orientation. Even if we’re not talking Foucault or Derrida, the tendency in the humanities is to build theories that reject empirical readings of the world that rely on strict categorization. The 21st century humanistic move par excellence is to uncover the influence of “socially constructed” categories on one’s worldview (or one’s experimental results).

On Twitter, Melvin Wevers brings up the possibility of a “post-structuralist corpus linguistics.” To which James Baker and I replied that that might be a contradiction in terms. To my knowledge, there is no corpus project in existence that could be said to enact post-structuralist principles in any meaningful way. Such a project would require a complete overhaul of corpus technology from the ground up.

So where does that leave the digital humanities when it comes to the sorts of questions that got most of us interested in the humanities in the first place? Is DH condemned forever to gather interesting data without ever building (or challenging) theories from that data? Is it too much of an unnatural vivisection to insert structural, quantitative methods into a post-structuralist humanities?

James Baker throws an historical light on the question. When I said that post-structuralism and corpus linguistics are fundamentally incommensurable, he replied with the following point:

And he suggested that in his own work, he tries to follow this historical development:

Structuralism/post-structuralism exists (or should exist) in dialectical tension. The latter is a real historical response to the former. It makes sense, then, to enact this tension in DH research. Start out as a positivist, end as a critical theorist, then go back around in a recursive process. This is probably what anyone working with DH methods probably does already. I think Baker’s point is that my “problem” posed above (structuralist methods in a post-structuralist humanities) isn’t so much a problem as a tension we need to be comfortable living with.

Not all humanistic questions or theories can be meaningfully tackled with structuralist methods, but some can. Perhaps a first step toward enacting the structuralist/post-structuralist dialectical tension in research is to discuss principles regarding which topics are or are not “fair game” for DH methods. Another step is going to be for skeptical peer reviewers not to balk at structuralist methods by subtly trying to remove them with calls for more “nuance.” Searching out the nuances of an argument—refining it—is the job of multiple researchers across years of coordinated effort. Knee-jerk post-structuralist critiques (or requests for an author to put them in her article) are unhelpful when a researcher has consciously chosen to utilize structuralist methods.


This is in response to Collin Brooke, who asked for some lists of 5.

5 Books On My Desk

  1. The Bourgeois, Franco Moretti.
  2. In the Footsteps of Genghis Khan, John DeFrancis.
  3. Warriors of the Cloisters: The Central Asian Origins of Science in the Medieval World, Christopher Beckwith.
  4. Medieval Rhetoric: A Select Bibliography, James J. Murphy
  5. The Invaders, Pat Shipman

5 Most Played Songs in my iTunes

  1. All Night Long, Lionel Richie
  2. We Are All We Need, Above and Beyond
  3. In My Memory, Tiesto
  4. Two Tickets to Paradise, Eddie Money
  5. We Built This City, Starship

5 Toppings That I Just Put On My Frozen Yogurt

  1. Peanut butter cups
  2. Chocolate chips
  3. M&Ms
  4. cookie dough
  5. whipped cream

5 Alcoholic Beverages In The Kitchen

  1. Bud Lite Lime
  2. Jose Cuervo Silver
  3. Triple-sec
  4. E&J Brandy
  5. Fireball

5 TV Shows on Netflix Instant Que

  1. Mad Men
  2. Human Planet
  3. Don’t Trust the B—- In Apartment 23
  4. Wild India
  5. Blue Planet

Some Quick Text Mining of the 2015 CCCC Program

During CCCC last week, Freddie deBoer made a couple comments about the conference: first, that there weren’t as many panels on the actual work of teaching writing compared to panels on sexier topics, like [insert stereotypical humanities topic here]; and second, that not much empirical research was being presented at the conference.

Testing these claims isn’t easy, but as a first stab, here’s a list of the most frequent unigrams and bigrams in the conference’s full list of presentation titles, as found in the official program. Make of these lists what you will. It’s pretty obvious to me that the conference wasn’t bursting at the seams with quantitative data. Sure, research appears at the head of the distribution, but I’ll leave it to you to concordance the word and figure out how often it denotes empirical research into writers while writing.

Then again, big data was a relatively popular term this year. It was used in titles more often than case studies, though case studies was used more often than digital humanities.

To Freddie’s point, the word empirical only appears 11 times in the CCCC program; the word essay appears only 16 times. Is it therefore fair to say there weren’t many empirical studies on essay writing presented this year? Maybe. Maybe not.


One way to get a flavor for the contexts and connotations of individual words and bigrams is of course to create a text network. I’ve begun to think of text networks as visual concordances.

Here is a text network of the tokens writing, write, writer, writers, writing_courses, classroom, and classrooms in the CCCC program. One thing to notice here is that each of these words is semantically related, but in the panel and presentation titles, they exist in clusters of relatively unrelated words. I had expected to discover a messy, overlapping network with these terms, but they’re rather distinct, as judged by the company they keep in the CCCC program. Even the singular and plural forms of the same noun  (e.g., from classroom to classrooms, writer to writers) form distinct clusters.


In relation to Freddie’s point, this network demonstrates that words or bigrams that are prima facie good proxies for “teaching writing” often do lead us to presentations that are pedagogical in nature. However, just as often, they lead us to presentations that are only tangentially or not at all related to the teaching of writing and to the empirical study of writers while writing.

Thus, writer forms a cluster with FYC, student, and reader but also with identity, ownership, and virtual. The same thing occurs with the other terms, though writing by far occurs alongside the most diverse range of lexical items.





This is about as much work as I’m interested in doing on the CCCC program for now. In my last post, I put a download link for a .doc version of the program, for anyone interested in doing a more thorough analysis, whether to test Freddie’s claims or to test your own ideas about the field’s zeitgeist.

However, it’s always important to keep in mind that a conference program might tell us more about the influence of conference themes than about the field itself.

ADDED: Here is a list of all names listed at the end of the CCCC program (CCCCProgramNames). Problem is, it’s a list of the FIRST and LAST names, with each given its own entry. If someone is inclined, they can go through this list and delete the last names, which will leave you with a file that can be run through a Gender Recognition algorithm, to see what the gender split of CCCC presenters was.