Ranking Native American language health

I recently finished reading Ellen Cushman’s The Cherokee Syllabary, an excellent book on the history and spread of the writing system developed by Sequoyah for the Cherokee tribe. Cushman does a thorough job explaining how the syllabary works as a syllabary, rather than describing it in alphabetic terms. She argues that to explain a syllabary in terms of one-to-one sound-grapheme correspondence (which is often the tact in linguistic work) is already to analyze it in alphabetic terms.

One of Cushman’s central projects in the book is to demonstrate how the Cherokee syllabary—both its structure and graphic representation—grew from Cherokee culture. It was not, she argues, a simple borrowing and re-application of the Roman alphabetic script. Most scholars would disagree with her, including Henry Rogers in Writing Systems: A Linguistic Approach and Steven Roger Fischer in A History of Writing. Fischer claims that “using an English spelling book, [Sequoyah] arbitrarily appointed letters of the alphabet” to correspond with units of sound in Cherokee (287). Cushman counters this claim by pointing out that linguists only make it after looking at the printed form of Cherokee, which, by necessity, remediated Sequoyah’s original syllabary into a more Latinate form. Cushman provides us with pictures of the original syllabary, as well as a new Unicode font that she believes more adequately represents the original style:

Much of Cushman’s book is devoted to showing the connection between Cherokee culture and the syllabary, a connection which obviates the need to assume some sort of alphabetic borrowing.

I’m not at all convinced by this main argument (still lots of Latinate forms up there), but I was quite interested, after reading the book, in another point Cushman makes about what it means to be Native American, both historically and contemporarily. She posits “four pillars of Native peoplehood: language, history, religion, and place” (6). I would argue that language is the most powerful of the four, but Cushman merely claims that the loss of the Cherokee language would “spell the ruin of an integral part of Cherokee identity.”

No doubt it would. And this got me thinking about native language health in general. As regards Cherokee specifically, Cushman writes that “while the Cherokees are one of the largest tribes in the United States, the Cherokee Nation estimates that only a few thousand speak, read, and write the Cherokee language” (6). I checked this statistic and found it to be correct but misleading. Perhaps only a few thousand Cherokees “speak, read, and write” Cherokee, but 16,000 speak the language.

So what about other native languages? Using Ethnologue and the World Atlas of Language Structures, I ranked all native American languages (and a few Canadian languages) by their ‘linguistic health’, measured purely as number of speakers. Here’s a bar chart of native languages with more than 100 speakers. (Click to enlarge.) Already, you can notice the seriously skewed curve that I’ll discuss in a moment . . .

Now, no native language in America (or Canada) is ‘healthy’ compared to English, Spanish, Mandarin, Hindi, or the world’s other dominant languages. Nearly all native American languages are endangered, severely endangered, or extinct. Only one—Navajo—escapes the ‘endangered’ list, but even then, Navajo is lately considered ‘vulnerable’ because the youngest generation is switching to English.

Within this continuum of endangered native languages, however, there exists a highly skewed continuum of linguistic health. There are approx. 115 living languages in America, but only 35 possess more than 1,000 speakers. Only 9 possess more than 10,000 speakers. And only 3 possess more than 50,000 speakers. In other words, the great bulk of living native American languages are in bad shape, and will likely go extinct within the next generation, joining the 41 native languages that already have gone extinct. Here’s the ranking of native languages with fewer than 100 speakers:

And yet what interests me about this data is not the obvious point about language loss in our post-colonial present. Language loss is the inevitable outcome in the wake of conquest; Old English itself was lost when the Norman French invaded Britain. Rather, what interests me is that, extinction and severe endangerment being the rule, several languages have managed to become glaring exceptions to the rule. Why?

According to my list, there are approximately 454,515 native language speakers in America—and parts of Canada, since I’ve included Cree and Ojibwe, Canada’s healthiest native languages, in my list (see the end of this post for more methodological details). At the start of the colonial era, there were somewhere between 2 and 7 million natives living in what is now the U.S. and Canada, with most of that population inhabiting the U.S. Splitting the difference, we can say there were 4 .5 million native language speakers pre-conquest but only 454,515 today. That’s a nearly 90% reduction in native language speakers over the course of 500 years.

(Note: this is not the same as a reduction in population. There are currently 2.9 million native Americans in the U.S., which, depending on your source, is anywhere from a net gain in population between the 15th and 21st centuries, or a loss of around 50-60% total native population. The comparatively drastic loss in number of native language speakers, however,results from the fact that most native Americans have, both recently and historically, switched to English.)

Speaking of languages, then, not population, it seems as though total annihilation is the most probable outcome for a language after conquest. It seems almost inevitable that a conquered population’s language will eventually become the language of the conqueror. (This is why only 100,000 people speak Irish in Ireland, and why no one speaks an un-Romanized version of English.)

Thus, it’s not surprising that most native languages possess fewer than 1,000 speakers, or that more than half only have between 1 and 100 speakers—i.e., it’s not surprising that more than half of native American languages are practically extinct. If we ignore the nine ‘healthiest’ native languages (the outliers with more than 10,000 speakers), then the total reduction in native language speakers between pre-colonial times and today rapidly approaches 100%.

Which returns us to the interesting thing about this data: the existence of these (comparatively) healthy native American languages. The nine healthiest languages have a total of 368,259 speakers, which translates to 81% of all native language speakers across all tribal languages; and yet these nine languages comprise only 7% of all native languages. In other words, 81% of native language speakers in America and parts of Canada speak only 7% of the existing native languages (less than 4% of all native languages, if we factor extinct languages and all Canadian languages into the equation).

I imagine that if we look at any area on the globe where conquered indigenous languages jostle beside more powerful indigenous or colonial languages, we’ll find similar data showing that, even amongst the less powerful languages, there remains a very skewed hierarchy of linguistic health. One can’t help wondering what’s at work here . . .

I enjoy compiling large sets of data like this because certain questions just don’t come into sharp focus until we compile the data. I think most rhet/comp scholars, like Cushman, have a general understanding that certain native American languages are in better shape than others; however, until we take the time to work with the actual data set (all living and extinct native American languages), we won’t discover this skewed pattern within it, and we won’t be able to formulate what, to my mind, are highly interesting and relevant questions: why and how have certain languages managed to survive and (comparatively) thrive while most other native languages have gone extinct or dwindled to only a few hundred speakers? What did these languages and tribal groups have going for them that the others didn’t? Was it a purely linguistic advantage, a purely geopolitical advantage, or a combination of both?

In part, we can read Cushman’s book as an answer to these unformulated questions. While Cushman spends a lot of time (rightly) describing language attrition among contemporary Cherokees, she perhaps doesn’t realize that Cherokee is doing a hell of a lot better than most other native languages. Although her book presents something of a contrast between the language’s current weakened state and the syllabary’s historic role in uniting and strengthening the Cherokee against further Western encroachment, we can see, in light of this data, how the contrast is perhaps instead a partial explanation for the fact that Cherokee isn’t as unhealthy as the vast majority of native American languages. In other words, the existence of the Cherokee syllabary may very well be one of the reasons why Cherokee exists on the healthier side of living native languages, why Cherokee isn’t entirely extinct.

Stylizing Sequoyah’s thought process, Cushman writes, “If whites could have a writing system that so benefited them, filling them with self-respect and earning the respect of others, then Cherokees could have a writing system with all this power as well” (35). After compiling statistics on native language health, I can see that Cushman, in focusing on current language attrition among the Cherokee, misses a deeper exploration of a compelling possibility: that the syllabary’s power not only bolstered the Cherokee people but also perhaps played a part in saving the Cherokee language itself from total extinction. The syllabary’s strengthening role was not an historic phenomenon; without it, perhaps there wouldn’t be a Cherokee language today at all.

This is a good example of why I think digital tools and databases have a lot to offer the humanities: without them, patterns go unnoticed and questions go unasked.

Methodological notes: I couldn’t rank linguistic health among native languages without first deciding what “counted” as a native language and what was simply a dialect of a language. This language/dialect issue is sometimes difficult to navigate, and Ethnologue typically gives each dialect its own language code. But such granularity is misleading; Madrid Spanish and Buenos Aires Spanish are different in many respects, but speakers in both places can understand one another because they are still, despite the differences, speaking Spanish.

Mutual intelligibility between speaker populations is the general rule for differentiating a dialect from a separate language, and I’ve done my best to follow that rule. For example, I’ve counted Ojibwe as a single language, even though Ojibwe is in fact a continuum of dialects; on the other hand, I’ve divided the Miwok continuum into different languages (Sierra Miwok, Plains Miwok, et cetera). Speakers of the Miwok languages, while closely related, have difficulty understanding one another in a way that speakers of Ojibwe dialects do not. So, Ojibwe is a single language, while the Miwok ‘dialects’ should really be considered separate languages.

However, none of this made huge differences in the ranking. Some might quibble with my grouping of all Ojibwe or Cree dialects into a single language, but even had I taken out the dialects that aren’t perfectly intelligible with the others, each of these languages still would have retained tens of thousands of speakers. Conversely, even had I counted all Miwok speakers as a single linguistic group, Miwok would still have fewer than 50 speakers.

Finally, when compiling statistics on numbers of speakers for each language, I used field linguists’ counts when they were available, rather than census counts, which tend to err on the side of liberality. (E.g., according to the U.S. census, there are over 150,000 Navajo speakers, but most linguists consider this an unlikely number.)

The trouble with language phylogenies

A new paper by Bouckaert et al. claims to show support for the hypothesis that Anatolia was the birthplace of the Indo-European languages, going against the more popular hypothesis that the ancient inhabitants of the Steppes were the linguistic progenitors of the world’s most successful non-Sino languages. (Here’s the accompanying free website detailing their work.)

The methodology of the paper is new to me: Bayesian phlyogeography. What the team did was to compile a large dataset of cognates (e.g., English water and German wasser), which presumably show a common ancestry. They then plotted known contemporary and dead languages onto a map, based on known information about where the languages are or were spoken. They modeled the evolution of language change as “the gain and loss of cognates along the branches of an unknown family tree, using an approach called Bayesian phylogenetic inference to infer the set of language trees that makes the [known] cognate data most likely” (emphasis added). Using this data and model, the team ran a series of “Brownian walks” that, working backwards, showed how far and where a language could have spread given its known ancestry (i.e., given the constraints of the known geographies and phylogenies).

Essentially, they combined what is known about languages’ geographies and phylogenies and, using a lot of equations I don’t understand, worked backwards to model probable locations of where all these Indo European languages actually came from. From the paper:

The Bayesian approach we employ means that we can directly test support for the Steppe homeland hypothesis versus the Anatolian homeland hypothesis. This is because the method we use does not produce a single answer – e.g. the homeland is at x degrees longitude and y degrees latitude. That would not be all that useful, because if you want to test between competing theories, you need some estimate of uncertainty – how sure are you that the origin is at one location versus another?

There is uncertainty in the relationships between the languages (nobody can say with absolute certainty that one particular family tree is the true one – for 103 languages there are more possible trees than there are atoms in the universe!), there is uncertainty in the time scale (we can’t know for sure exactly how fast languages change), and even if we knew the family tree and time scale exactly, there is uncertainty in the geographic expansion process so we cannot pin down the location of the root exactly.

One of the major advantages of the Bayesian approach is that we do not produce a single answer, but instead account for all those uncertainties using some clever algorithms (called Markov Chain Monte Carlos methods) that sample language trees, divergences times and locations at all points on the tree, in proportion to how likely they make our observed data . . . So we were able to run our analyses and directly compare how often the origin locations we inferred fell in the range proposed for the Steppe theory versus in the range proposed for the Anatolian theory  . . . As we report in the paper, using either version of the Steppe theory, it was the Anatolian theory that came out on top.

The map at the bottom of this page shows that, given a wide range of uncertainties about where languages travel, how fast they evolve, and which is related to which, the range of possible geographic birthplaces of Indo European languages can be limited most probably to the Anatolian region.

Like I said, I still don’t fully understand how to do Bayesian methods, but I appreciate them on an epistemological level. Bayesian methods, such as the ones used here, don’t model things based on an “all things being equal” approach. Insofar as I understand them, Bayesian equations take into account known and assumed information and produce probabilistic answers, answers that change along with the known information.

This paper works with two “knowns” about languages: their geographies and their phylogenies. I have no problem with the way the authors of the paper handled the former. I do think there is a problem with how they handled the latter.

A linguistic phylogeny (family tree) is often modeled like a human family tree, with a parent giving rise to children and related parents giving rise to cousins of the children, and so forth. As the authors put it in their paper:

Languages evolve through time in a manner similar to biological species. As groups of speakers become separated, their speech drifts apart forming new descendant languages, and eventually whole families of related languages. Over thousands of years this process has generated the 6000+ languages in the world today . . . We can represent the relationships between languages on a family tree, otherwise known as a ‘phylogeny’. A simple example of a phylogeny is a family tree where the leaves of the tree represent the children in a family and branches represent relationships between parent and child.

The problem with this model is that it doesn’t take into account what the geneticists might call “admixture” between separate populations. It is true that, when a language is in isolation, it evolves through time on its own; it is also true that, when speakers of a language drift apart geographically, the languages of these split populations will evolve through time in two new ways. However, the authors of the paper fail to take into consideration another known fact about language evolution: when languages come into contact, one or both of the languages will evolve in a new way due to the linguistic contact.

The problem is, if we try to put language contact into our phylogenies, we no longer have linear branching phylogenies. Rather, we are forced to view language evolution as more of a network than a linear family.

We needn’t look further than Standard English to see how this works. (This graphic oversimplifies the relations, but it gets the point across . . .)


We know the recent history of English, and it sure as hell doesn’t include a straight evolution-in-time from Old English to contemporary English, from Beowulf to The Canterbury Tales, from Caedmon to Malory. Old English texts, circa 1000 AD, are incomprehensible to the modern English speaker; Middle English texts, circa 1400 AD, are quite easy to read. What happened during those 400 years? Punctuated linguistic equilibrium? Geographic separation?

No. The Norman invasion happened. The most recent ancestor of the French language (which, I believe, had a lot in common with modern Spanish)  traveled across the Channel from its Latin family tree and interrupted the evolution of the Germanic tree, and the interruption left us with, among other things, the English language we know and love today. English is not pure Germanic, with a pure and direct lineage from proto-Germanic to Old English to English. Neither does it boast a purely Latinate ancestry. English is a mix of two otherwise separate family trees, the Germanic and the Latin.

Percentages come into play here; so, too, does the untangling of the two ancestral strands. Is English half German, half Latin? No, probably not. But this is indeed a fascinating question: figuring out how these internetworked phylogenies surface in the phonology, morphology, and syntax of contemporary Standard English.

(Having learned German and now learning Spanish, I am prepared to offer a general generalization: English shares more with Spanish and French in terms of syntax and idioms; it shares more with German in terms of phonology; it also shares German’s simple, consonant-based verbal inflection system. At this point, I’d say that English is a Latin idiom without a Latin verbal system, spoken with a German accent.)

There are interesting critiques and comments on the study’s methods and outcome here and here. And anthropology blogger Dienekes pointed me to a blogpost he wrote on network models of linguistic relations, which, obviously, I think are the right kinds of models for language evolution and the relations between living languages.

Fun with idioms and expletives

In general, you can’t re-arrange idiomatic expressions too much without losing the idiom. Thus,

1. The shit hit the fan

2. The fan was hit by shit

Reading (2), most of us probably envision literal shit literally hitting a literal fan before we re-compute the sentence as a passivized idiom. Still, it’s a legitimate and grammatical sentence. I can imagine someone using it for extra comedic effect.

However, consider the following idiomatic expression and its passive equivalent:

3. It rained cats and dogs (or, It was raining cats and dogs)

4. #Cats and dogs were raining

5. *Cats and dogs were raining by it

Unlike (2), sentences (4) and (5) are simply ungrammatical; actually, (4) is grammatical but semantically unacceptable because its theta grid is all mucked up.

What gives? Why can’t we passivize the idiomatic expression “It rained cats and dogs” the way we passivized “The shit hit the fan”? I haven’t seen this problem addressed in linguistics journals, but I’d bet a bottle of nice Scotch that it has to do with the expletive.

In a generative framework, passive sentences are structurally equivalent to their active counterparts. “The boy hit the ball” becomes “The ball was hit by the boy” via a simple transformation triggered by the addition of auxiliary “was.” Passivizing “hit” with the auxiliary blocks the verb’s ability to assign case to the DP “the ball.” So, to get case, the DP moves to spec-TP, and “the boy,” which was the subject in the active sentence, moves to an optional adjunct phrase (typically a by-phrase).


These movements cannot occur with the idiomatic expression in (3) because the active-voiced idiom contains an expletive in spec-TP. Passivizing the verb into “were raining” would mean moving the expletive to an optional adjunct phrase, as in (5). But expletives can’t appear in prepositional phrases! They aren’t allowed to surface in X-governed nodes. That explains why (5) is ungrammatical and why (4) sounds meaningless.

Now, while I was working on this fun linguistics puzzle, someone brought to my attention that it’s possible to passivize the following sentence, which is not idiomatic but does contain an expletive:

6. We found it to be raining.

7. It was found to be raining.

However, the expletive in this passive sentence generates in relation to “found”, an Exceptional Case Marking verb, and an infinitival to-be clause. So, the active version of this sentence can be analyzed simply as a small-clause structure:

[We found [it to be raining]]

The passive sentence, (7), generates by moving the expletive from spec-TP in (6) to spec-TP in (7) so that it can receive case. In other words, expletive it never needs to surface in an X-governed node like it would in (5). So, the sentence can be passivized.


Halliday v. Chomsky

Entertaining multiple hypotheses is difficult. We tend to make up our minds about things long before we’ve sifted through the evidence and the arguments about a specific issue. For example, I have long been skeptical about Chomskyan linguistics and have always held a deep regard for any linguistic theory that was not Chomskyan. However, my skepticism toward one and preference for the other did not result from long hours of objective study. My disposition resulted from a general distrust of positivistic approaches to language, my general skepticism about the existence of Universal Grammar, and, on the other end, my general fondness for linguists who took meaning and social function into their accounts of grammar.

The linguist who inspired me the most was Michael Halliday. His systemic functional grammar was the first linguistic theory I tried to understand. The interesting thing is, however, that Halliday never comes out and positions systemic functional grammar against generative grammar, the minimalist program, or any other Chomskyan theories. I can’t remember what essay precisely, but somewhere Halliday makes the point that his is simply a different approach to the study of language. It neither denies nor affirms the generative approach, with its goal of modeling UG. Systemic functional grammar simply presupposes different things about language and systematizes different aspects of it.

Now that I’m studying generative grammar, I am coming to realize that Halliday is quite right not to position his theory against Chomsky’s. It’s different, not necessarily opposed. It has different goals, assumptions, and frameworks. Probably the most important difference can be summed up this way: while generative syntax does not try to link its theories with phonology on one hand or semantics and pragmatics on the other, systemic functional grammar is an attempt to make sense of all dimensions of language as they work and evolve together, from phonetic production to pragmatic context.

Another difference between Halliday’s and Chomsky’s linguistic theories, however, is the framing of language as structure versus system. Here’s how Halliday puts it in his Introduction to Functional Grammar:

Structure is the syntagmatic ordering in language: patterns, or regularities, in what goes together with what. System, by contrast, is ordering on the other axis: patterns in what could go instead of what . . . Any set of alternatives, together with its condition of entry, constitutes a system in the technical sense.

Halliday gives the example of positive/negative polarity in a clause. A clause is either positive (“Ich liebe dich”) or negative (“Ich liebe dich nicht”); statistically speaking, 90% of clauses will be positive, 10% negative; this system of entry points, choices, and probabilities forms the basis of systemic functional grammar:


Halliday continues his differentiation between systemic and generative grammars:

A text is the product of ongoing selection in a very large network of systems – a system network. Systemic theory gets its name from the fact that the grammar of a language is represented in the form of system networks, not as an inventory of structures. Of course, structure is an essential part of the description; but it is interpreted as the outward form taken by systemic choices, not as the defining characteristic of language. A language is a resource for making meaning, and meaning resides in systemic patterns of choice.

In other words, while generative grammar seeks to model the computational steps our brains make in order to generate every possible sentence in a language, systemic grammar seeks to describe the myriad systems of choice (from phonemic to semantic) that work together at different but connected levels to make meaning with a given utterance.

Thus, generative grammar takes into consideration things like mood and tense while developing its theories, but the endgame is to model the mental structure of a grammatical utterance, including the moves and transformations it undergoes during computational brain processes:


Systemic functional grammar, on the other hand, models the nearly instantaneous choices that unfold in time when someone utters something. With each step into the system, each choice further limits future choices until a speaker is left with a possible utterance, like “Is there any corn left?”


It’s not quite true to say that the approaches are incompatible, which implies necessary contradictions within their theoretical frameworks. It’s better to say that we’re dealing with the apples and oranges of linguistic theory. Of course, one must choose to work in one domain or the other, but, at the moment, it’s quite possible to entertain both theories in my mind without making any decisions about which one is “right” or “better.” It’s a good exercise in learning to see the beneficial and the problematic, the good and the bad, of two divergent theories.

As I said at the beginning, entertaining two hypotheses is difficult . . . but it’s worthwhile to try.

Meanings of ‘writing’ and ‘rhetoric’ in RSQ and CCC

Earlier this year, I compiled two small corpora of article abstracts from the most prominent journals in the American fields of rhetoric and writing studies: Rhetoric Society Quarterly and College Composition and Communication, respectively. The RSQ abstracts stretch from Winter 2000 (30.1) to Fall 2011 (41.5), for a total of 220 abstracts. The CCC abstracts stretch from February 2000 (51.3) to September 2011 (63.1), for a total of 261 abstracts. I think that article abstracts are a good vantage point for looking at disciplinary trends, because (in the humanities, anyway) researchers tend to write abstracts that function like movie previews. Designed to appeal to a specific disciplinary audience, abstracts signal that their articles ‘belong’ in the field by using all the right buzz words, name-dropping all the right researchers, and making all the right stylistic moves that make other researchers want to read the article.

Using Python and the Natural Language Toolkit to explore these two corpora of abstracts, I’ve discovered both interesting and unsurprising things about how rhetoric and writing studies have taken shape, over the last decade, as separate but ambivalently related disciplines. One of the more interesting pieces of capta demonstrated by the corpora is that the words ‘writing’ and ‘rhetoric’ share grammatical contexts with very different lexical items, suggesting that each word means something different in each journal.

Before I get to the details, though, here’s a bit about my methodology:

With Python and NLTK, you can chart how a word is used  similarly or differently in two corpora.  For instance, a concordance of the word ‘monstrous’ in Moby Dick reveals contexts such as ‘the monstrous size’ and ‘the monstrous pictures’. Running a few extra commands, you discover that words such as ‘impalpable‘, ‘imperial’, and ‘lamentable’ are also used in these same contexts. Running an identical search on Sense and Sensibility, however, reveals that ‘monstrous’ shares contexts with quite different terms: ‘very’, ‘exceedingly’, and ‘remarkably’. Dissimilar contexts reveal different connotations for ‘monstrous’ in each novel, positive or neutral in Austen but negative in Melville. This, basically, was the method I applied for mining the usage of ‘rhetoric’ and ‘writing’ in the abstracts corpora (more details below the tables, though).

‘Rhetoric’ occurred 244 times in RSQ abstracts and 69 times in CCC abstracts. ‘Writing’ occurred 22 times in RSQ and 251 times in CCC. I compiled common grammatical contexts for each term in each corpus. Each context took the form,


where N was any term and x was ‘rhetoric’ or ‘writing’, respectively.


The more and more contexts shared by two terms, the more and more likely it is that the two terms, within the specific corpus, are used interchangeably. One way to get your head around this fact is by looking at grammatical contexts without an operative term:

(1) I _ you

In an English corpus, the words that appear in that _ context will be semantically limited. Hundreds, if not thousands of words, will indeed fit in that context, but given such a large list of lexical items, all the items will nonetheless share some kind of discerning semantic value: for example, all the words  that can appear in the context of (1) can only be transitive or di-transitive verbs, and they cannot be 3rd person present verbs. Right off the bat, this context has limited its possible terms down to a fraction of all the words in the English lexicogrammar. Throw in a second context, and the list of terms grows even smaller:

(2) is _ by

Given rules of English morphology and semantics, most of the words that appear in this context will be past tense action or emotive participles (e.g., loved, felt, killed, written, eaten, trapped). Terms that can appear in both (1) and (2) are quite limited: only transitive or di-transitive verbs, no 3rd person presents, and now, no irregular verbs (e.g., written, wrote, eaten, ate).

If we start using contexts that contain more than just semantically null stopwords on both sides, it’s easy to see how the list of terms can grow very short very quickly:

(3) I _ girls

What kind of words can appear in (1), (2), and (3)? No irregular verbs, no 3rd person present verbs, and now, probably no di-transitive verbs, given the lack of a definite article before ‘girls’ (e.g., I put the girls to bed). Words that can appear in all three of these contexts would likely be words that are easily grouped together in some meaningful way.

So, when a corpus analysis tells us that two words share half a dozen or more contexts in a specific corpus, you can see how these words might share not only grammatical but semantic and definitional attributes within the corpus. The simple example of ‘monstrous’, ‘lamentable’, and ‘imperial’ in Melville demonstrates this statistical fact. This fact is also proved by the large number of contexts (20!) shared by ‘writing’ and ‘composition’ in the CCC abstracts, two words that I knew, a priori, were synonymous in the American field of writing studies. The analysis bears out this a priori knowledge, thus confirming the methodology.


While the terms sharing 2 or 3 contexts in the tables above are interesting, our attention should be focused on the terms near the top of the lists. In RSQ, ‘language’, ‘discourse’, ‘art’, ‘persuasion’, ‘theory’, and ‘texts’ tell us indirectly what the word ‘rhetoric’ means in that journal; in CCC, ‘writing’, ‘composition’, ‘education’, ‘place’, and ‘theory’ provide the same information.

The highlighted terms are the terms that overlap between each journal’s set of common contexts. The overlap is minimal. For ‘rhetoric’, only a single word (‘theory’) overlaps and surfaces in more than 3 distinct contexts in each journal; for ‘writing’, no word overlaps and surfaces in more than 3 distinct contexts. More telling is that ‘writing’ and ‘rhetoric’ themselves possess a high degree of interchangeability in CCC, sharing 7 distinct contexts, but a very low degree in RSQ, sharing only 2 distinct contexts. In other wordsthese capta suggest that ‘writing’ and ‘rhetoric’ mean nearly the same thing in CCC but do not mean the same thing at all in RSQ.

Loading a corpus into the Natural Language Toolkit

[UPDATED: See this post for a more thorough version of the one below.]

Looking through the forum at the Natural Language Toolkit website, I’ve noticed a lot of people asking how to load their own corpus into NLTK using Python, and how to do things with that corpus. Unfortunately, the answers to those question aren’t exactly easy to find on the forums; they’re scattered around in different threads, and often a bit vague. Part of this is probably because the REAL coders on the forums want us noobs to figure it out ourselves. And I get that. The more I play on Python, the more I realize that the best way to learn code is to read about the basics, then start dicking around to see what works and what just gives you a scroll of red. You won’t learn anything by constantly asking “PLEASE TO HAVE TEH CODE?” on web forums.

That being said, I think the ability to load a corpus into the program is a pretty basic step. I wonder how many people have abandoned exploring the program because they couldn’t figure out how to load something other than the pre-loaded corpora. So, I’ll start throwing up some EASY lines of code for some BASIC  functions, in the hope that noobs like me googling around for answers might run across them.

For now, I’ll provide the basic steps for loading your own non-tagged corpus into the program:

1. Save your corpus as a plain text format–e.g., a .txt file–using Notepad or some other text editor. Depending on what you want to do with your corpus, it might be easier to use Word first to deal with punctuation, capitalization, et cet. You can smooth out the text using NLTK, but it’s easy to do it in Word before saving the document in a plain text format. (Getting rid of punctuation and capitalization is important when compiling lexical statistics; most corpus analysis tools will, for example, count rhetoric and Rhetoric as two separate lexical entries because one is capitalized and one isn’t.)

2. Save the .txt file in the Python folder.

3. Load up IDLE, the Python GUI text-editor.

4. Import the NLTK book:


5. Import the Texts, like it says to do in the first chapter of the NLTK book. There are certain tools that won’t work unless these are imported.


6. Now you’re ready to load your own corpus, using the following code:


Basically, these lines simply split all the words in your file into a list form that the NLTK can access and read, so that you can run analyses on your corpus using the NLTK tools. Above, ‘ccc.txt’ is the plain text file which was saved in my Python folder. ‘Abstracts’ is just what I label the file while working with the NLTK. You can name it whatever you want.

Now, just run a basic concordance command to make sure it works:


And you’re ready to go.

Rhetorical N-grams

For most of its 2500 year history, the Hellenistic word rhetoric has had a stable definition: the study or practice of persuasive and effective communication through speech or writing. However, beginning with Kenneth Burke, Richard Weaver, and I.A. Richards in the early twentieth century, the definition of rhetoric began to take on nuances it had previously lacked. Rhetorical theory began to concern itself with symbols in general, not just language. It also began to look for “persuasion” where none was expected, in order to chart the countless small ways that people are persuaded to do what they do, believe what they believe, and live how they live. With this expanded notion of what rhetoric was, rhetorical theory began to take on the role of cultural criticism, and indeed, many academic rhetoricians today are allied more closely to cultural studies than to the rhetorical tradition. In a lot of work, the word “rhetoric” seems to mean a million different things, sometimes negative and sometimes positive. All we need to do to see how many different definitions “rhetoric” possesses in the academic world is look at the titles of academic texts. Here are just a few: The Rhetoric of Fiction; Reclaiming the Rural: Essays on Literacy, Rhetoric, and Pedagogy; Rhetoric and Resistance in the Corporate Academy; Nineteenth-Century Rhetoric in North America; Starring the Text: The Place of Rhetoric in Science Studies. It seems as though we have moved beyond the Aristotelian understanding of rhetoric as “a counterpart to dialectic” and “the ability to see the available means of persuasion.”

An interesting way to chart this change in rhetoric’s definition is to use Google’s N-Gram viewer. (Yes, I set smoothing to 0, for better or worse.)


By itself, the word “rhetoric” seems to become more popular after the 1960s, which is precisely when the “rhetorical renaissance” really began to take off in America. We see a change in rhetoric’s definition and a concomitant expansion in the use of the term. But even more telling than this unigram distribution are the other N-grams associated with academic rhetoric. For example, the genitive construction “the rhetoric of” is fairly limited prior to 1960 but noticeably more popular afterward:


Before 1960, I would expect, “the rhetoric of” typically refers to “the rhetoric of Aristotle” or “the rhetoric of Ramus,” that is, it typically surfaces as a subjective genitive referring to the rhetorical theory of a specific person. Today, however, this trigram not only continues to be used as a subjective genitive; it is also frequently deployed as an objective genitive, referring to some person doing rhetorical action. This might account in part for the increased use of the trigram. Also interesting, of late, is the increased use of a plural form of nominal and genitive rhetoric:



“The rhetorics of” is completely unattested prior to the twentieth century; its use undergoes an exponential increase! This says a lot about the expanded definition of the word in 20th century rhetorical theory. Another trigram that appears overnight in the 1960s is “A rhetoric of”, which exhibits a genitive construction coupled with an indefinite article.


Of course, comparing the unigram “rhetoric” to “history” and “philosophy” makes it clear that the expanded interest in rhetoric was still a very limited phenomenon: