“Re-purposing Data” in the Digital Humanities

Histories of science and technology provide many examples of accidental discovery. Researchers go looking for one thing and find another. Or, more often, they look for one thing, find something else but don’t realize it until someone points it out in a completely different context. The serendipitous “Eureka!” is the most exciting of all.

Take the microwave oven. Its inventor, Percy Spencer, was not trying to discover a quick, flameless way to cook food. He was working on a magnetron, a vacuum tube designed to produce electromagnetic wavelengths for short wave radar. One day, he came to work with a chocolate bar in his pocket. The wavelengths melted the candy bar. Intrigued, Spencer tried to pop popcorn with the magnetron. That worked, too. So Spencer constructed a metal box, then fed micro-waves and food into it. Voila. A radar tech discovers that a property of the magnetron can be repurposed, from creating short wavelengths for radar to creating hot dogs in 30 seconds.

Another example is the discovery of cosmic microwave background radiation, the defining piece of evidence in support of the Big Bang Theory. Wikipedia tells the story well:

By the middle of the 20th century, cosmologists had developed two different theories to explain the creation of the universe. Some supported the steady-state theory, which states that the universe has always existed and will continue to survive without noticeable change. Others believed in the Big Bang theory, which states that the universe was created in a massive explosion-like event billions of years ago (later to be determined as 13.8 billion).

Working at Bell Labs in Holmdel, New Jersey, in 1964, Arno Penzias and Robert Wilson were experimenting with a supersensitive, 6 meter (20 ft) horn antenna originally built to detect radio waves bounced off Echo balloon satellites. To measure these faint radio waves, they had to eliminate all recognizable interference from their receiver. They removed the effects of radar and radio broadcasting, and suppressed interference from the heat in the receiver itself by cooling it with liquid helium to −269 °C, only 4 K above absolute zero.

When Penzias and Wilson reduced their data they found a low, steady, mysterious noise that persisted in their receiver. This residual noise was 100 times more intense than they had expected, was evenly spread over the sky, and was present day and night. They were certain that the radiation they detected on a wavelength of 7.35 centimeters did not come from the Earth, the Sun, or our galaxy. After thoroughly checking their equipment, removing some pigeons nesting in the antenna and cleaning out the accumulated droppings, the noise remained. Both concluded that this noise was coming from outside our own galaxy—although they were not aware of any radio source that would account for it.

At that same time, Robert H. DickeJim Peebles, and David Wilkinsonastrophysicists at Princeton University just 60 km (37 mi) away, were preparing to search for microwave radiation in this region of the spectrum. Dicke and his colleagues reasoned that the Big Bang must have scattered not only the matter that condensed into galaxies but also must have released a tremendous blast of radiation. With the proper instrumentation, this radiation should be detectable, albeit as microwaves, due to a massive redshift.

When a friend (Bernard F. Burke, Prof. of Physics at MIT) told Penzias about a preprint paper he had seen by Jim Peebles on the possibility of finding radiation left over from an explosion that filled the universe at the beginning of its existence, Penzias and Wilson began to realize the significance of their discovery. The characteristics of the radiation detected by Penzias and Wilson fit exactly the radiation predicted by Robert H. Dicke and his colleagues at Princeton University. Penzias called Dicke at Princeton, who immediately sent him a copy of the still-unpublished Peebles paper. Penzias read the paper and called Dicke again and invited him to Bell Labs to look at the Horn Antenna and listen to the background noise. Robert Dicke, P. J. E. Peebles, P. G. Roll and D. T. Wilkinson interpreted this radiation as a signature of the Big Bang.

Penzias and Wilson were looking for one thing for Bell Labs, found something else, thought it might have been pigeon shit, then realized they’d stumbled upon evidence directly relevant to another research project.

In the sciences, data are data, and once presented, they are there for the taking. “Repurposing data”—using data compiled for one project for your own project. In some sense, all scholars do this. Bibliographies and lit reviews signal that a piece of scholarship has built on existing scholarship. In the humanities, however, scholars are accustomed to building on whole arguments, not individual points of data. If Dicke, Peebles, and Wilkinson had been humanists, they would have asked, “How does the practice of detecting faint radio waves bounced off Echo balloon satellites relate to our work on cosmic background radiation?” Which is not necessarily the wrong question to ask, the connection might have been forged eventually, but given that everyone involved were scientists, no one posed the question that way, and I imagine it was much more natural for Penzias’ and Wilsons’ data to be removed from its  context and placed into another context. Humanists, on the other hand, are not conditioned to chop up another scholar’s argument, isolate a detail, remove it, and put it into an unrelated argument. This seems like bad form. Sources, their contexts, the nuances of their arguments are introduced in total—this is vital if you are going to use a source properly in the humanities.

Digital humanists construct arguments just like any other humanists, but rather than deploying what Rebecca Moore Howard calls “ethos-based” argumentation, DH’s typically traffic in mined and researched data—the locations of beginnings and endings in Jane Austen novels; citation counts in academic journals; metadata relating to the genders and nationalities of authors. These data always exist in the context of a specific argument made by the researcher who has compiled them, but data are more portable than ethos-based arguments, in which any one strand of thought relies on all the others. No such reliance exists, however, in data-based argumentation. In other words, an antimetabole: a data-based argument relies on the data, but the data do not rely on the argument.

A hypothetical example and a real one:

In “Style, Inc: Reflections on 7,000 Titles,” Moretti compiles a very particular set of data: the word counts of British novel titles between 1740 and 1850. He provides several graphs to document an obvious trend, that novel titles got drastically shorter throughout the 18th and 19th centuries. From these data, Moretti makes, as he usually does, a compelling argument about the literary marketplace and its effect on literary form:

As the number of new novels kept increasing, each of them had inevitably a much smaller ‘window’ of visibility on the market, and it became vital for a title to catch quickly and effectively the eye of the public. [Summary titles] were not good at that. They were good at describing a book in isolation: but when it came to standing out in a crowded marketplace, short titles were better—much easier to remember, to begin with. (187-88)

Moretti’s argument relies on his analysis of data about novel titles; his argument would be weaker (non-existent?) without the data. But now that these data have been compiled, are they useful only in the context of Moretti’s argument? Of course not. Let’s say I’m a book historian writing my dissertation on changing book and paper sizes between 1500 and 1900. Let’s say I’ve discovered (hypothetically—it’s probably not true) that smaller book sizes—duodecimos and even sextodecimos—proliferated between 1810 and 1900, relative to earlier decades in the 18th century. Now let’s say I find Moretti’s article on shortened book titles during the same period. Hmm, I think. Interesting. Never mind that “Style, Inc.” is focused on literary form, never mind that I’m writing about the materials of book history, never mind that I’m not interested in Moretti’s argument about literary form per seMoretti’s data nevertheless might generate an interesting discussion. Maybe I’ll look at titles more closely. Maybe I can even get a whole chapter out of this—“Titles and Title Pages in relation to Book Sizes.” A serendipitous connection. A scholar in book history and a literary scholar making different but in no way opposed arguments from the same data.

Real example: I’ve just finished a paper on the construction of disciplinary boundaries in academic journals. In it, I use data from Derek Mueller’s article which counts citations in the journal College Composition and Communication. I also compile citations from other journals, focusing on citations in abstracts. But the argument I make is not quite the same as Mueller’s. In fact, I analyze my data on citations in a way that hopefully shines a new light on Mueller’s data. Both Mueller and I discover (unsurprisingly) that citations in articles and abstracts form a power law distribution. Mueller argues that the “long tail” of the citation distribution implies a “loose amalgamation” of disparate scholarly interests and that the head of the distribution represents the small canon uniting the otherwise disparate interests. I argue, however, that when we look at the entire distribution thematically, we discover that each unique citation added to the distribution—whether it ends up in the head or the long tail—may in fact be thematically connected to many other citations, whether they also be in the head or the long tail. (For example, Plato is in the head of one journal’s citation distribution, and Aristophanes is in the long tail, but a scholar’s addition of Aristophanes to the long tail does not imply scholarly divergence from the many additions of Plato. Both citations suggest unity insofar as both signal a single scholarly focus on rhetorical history.)

I re-purpose Mueller’s data but not his argument. Honestly, in my paper, I don’t spend much time at all working through the nuances of Mueller’s paper because they’re not important to mine. His data are important—they and the methods he used to compile them are the focus of my argument, which moves in a slightly different direction than Mueller’s.

To reiterate: data in the digital humanities beg to be re-purposed, taken from one context and transferred to another. All arguments rely on data, but the same data may always be useful to another argument. At the end of my paper, I write: “I have used these corpora of article abstracts to analyze disciplinary identity, but this same group of texts can be mined with other (or the same) methods to approach other research questions.” That’s the point. Are digital humanists doing this? They certainly re-purpose and evoke one another’s methods, but to date, I have not seen any papers citing, for example, Moretti’s actual maps to generate an argument not about methods but about what the maps might mean. Just because Moretti generated these geographical data does not mean he has sole ownership over their implications or their usefulness in other contexts.

There’s a limit to all this, of course. Pop-science journalism, at its worst, demonstrates the hazards of decontextualizing a data-point from a larger study and drawing all sorts of wild conclusions from it, conclusions contradicted by the context and methods of the study from which the data-point was taken. It is still necessary to analyze critically the research from which data are taken and, more importantly, the methods used to obtain them. However, if we are confident that the methods were sound and that our own argument does not contradict or over-simplify something in the original research, we can be equally confident in re-purposing the data for our own ends.