Hindi 101

I’m taking Hindi 101 this semester. The Devangari script feels mildly ornate in my hand compared to the angularity of alphabets descended from the Phoenician script (including the English alphabet), but it is quite lovely and not as challenging as I had imagined. It is still an alphabet, after all, with a much closer sound-grapheme correspondence than one finds in English, where each letter—particularly vowels—can correspond to multiple phonemes. (English grammar is absurdly simple compared to all other major languages, but our spelling system must be a nightmare for foreign learners. There’s something to be said for language academies that control the drift between pronunciation and spelling.) Devanagari does, however, omit some vowel sounds and uses secondary or “dependent” vowel forms in most contexts, so it has something of the syllabary about it. In fact, the biggest mistake I make in class is to confuse two dependent vowels,  ी and  ो. The former is long “ee”, the latter is “o”, but in certain fonts (including my own handwriting), they look nearly identical.

The script’s biconsonantal conjuncts are mostly intuitive, though a few bizarre ones need to be memorized as separate graphemes. We have conjuncts in English, but I believe they are a relatively new innovation with limited usage. One example is the city logo of Huntington Beach, California. Hindi has a lot of these, and they are quite common.


An English biconsonantal conjunct.

Apart from learning a new script, the most enjoyable part of Hindi class has been coming across Romance or Germanic cognates. At an intellectual level, I know and have long known that Hindi and English, both Indo-European languages, share a genetic ancestry, which means that at some point in the distant past all Indo-European speakers spoke the same language. It’s easy to get a handle on the concept when talking about Romance languages: Spanish, Italian, and French all used to be Latin. There, we have a well documented history, stretching back through the Renaissance and middle ages to the familiar  world of Rome. However, when it comes to Proto Indo-European, we are faced with a deeper and wider canyon of time and an ancient world that is mostly unknown to us. The PIE speakers were probably living in the Pontic-Caspian steppe lands, but some evidence suggests that they may have been living in the greater Anatolian region; perhaps the most direct descendants of Proto Indo-Europeans are today’s Armenians, Turks, and Persians. They apparently kicked ass and took names because Indo European now stretches from the Pacific to the Indian Oceans.

But whoever they were, the PIE speakers are remote in a way that the Romans or Germanic tribes are not. Yet while doing my Hindi homework, every now and again I come across a word that clearly indicates the ancient linguistic (and genetic) connectedness between the Romans, the Germans, and the Hindi speakers. Kamiz for shirt; mez for table; kamra for room; mata for mother; pita for father; nam for name; darvaza for door . . . In Hindi class, when I say a word out loud that is clearly related to a European word, I am intoning sounds close to the ones that came from the lips of those ancient Indo-Europeans before they split eastward and westward to conquer Eurasia. To language nerds like me, it’s a chilling sensation.

Elliot Rodger’s Manifesto: Text Networks and Corpus Features

Analyzing manifestos is becoming a theme at this blog. Click here for Chris Dorner’s manifesto and here for the Unabomber manifesto.

Manifestos are interesting because they are the most deliberately written and deliberately personal of genres. It’s tenuous to make claims about a person’s psyche based on the linguistic features of his personal emails; it’s far less tenuous to make claims about a person’s psyche based on the linguistic features of his manifesto—especially one written right before he goes on a kill rampage. This one—“My Twisted World,” written by omega male Elliot Rodger—is 140 pages long, and is part manifesto, part autobiography.

I’ve made a lot of text networks over the years—of manifestos, of novels, of poems. Never before have I seen such a long text exhibit this kind of stark, binary division:


This network visualizes the nodes with the highest betweenness centrality. The lower, light blue cluster is Elliot’s domestic language; this is where you’ll find words like “friends”, “school,” “house,” et cetera . . . words describing his life in general. The higher, red cluster is Elliot’s sexually frustrated language; this is where you’ll find words like “girls,” “women,” “sex,” “experience,” “beautiful,” “never”  . . . words describing his relationships with (or lack thereof) the feminine half of our species.

It’s quite startling. Although this text is part manifesto and part autobiography, I wasn’t expecting such a clear division: the language Elliot uses to describe his sexually frustrated life is almost wholly severed from the language he uses to describe his life apart from the sex and the “girls” (Elliot uses “girls” far more frequently than he uses “women”—see below). It’s as though Elliot had completely compartmentalized his sexual frustration, and was keeping it at bay. Or trying to. I don’t know how this plays out in individual sections of the manifesto. Nor do I know what it says about Elliot’s mental health more generally. I’ve always believed that compartmentalizing frustrations is, contra popular advice, a rather healthy thing to do. I expected a very, very tortuous and conflicted network to emerge here, indicating that each aspect of Elliot’s life was dripping with sexual angst and misogyny. Not so, it turns out.

Here’s a brief “zoom” on each section:



In the large, zoomed-out network—the first one in the post—notice that the most central nodes are “me” and “my.” I processed the text using AutoMap but decided to retain the pronouns, curious how the feminine, masculine, and personal pronouns would play out in the networks and the dispersion plots. Feminine, masculine, personal—not just pronouns in this particular text. And what emerges when the pronouns are retained is an obvious image of the Personal. Rodgers’ manifesto is brimming with self-reference:


Take that with a grain of salt, of course. In making claims about any text with these methods, one should compare features with the features of general text corpora and with texts of a similar type. The Brown Corpus provides some perspective: “It” is the most frequent pronoun in that corpus; “I” is second; “me” is far down the list, past the third-person pronouns.

Here’s another narcissistic twist, found in the most frequent words in the text. Again,  pronouns have been retained. (Click to enlarge.)


“I” is the most frequent word in the entire text, coming before even the basic functional workhorses of the English language. The Brown Corpus once more provides perspective: “I” is the 11th most frequent word in that general corpus. Of course, as noted, there is an auto-biographic ethos to this manifesto, so it would be worth checking whether or not other auto-biographies bump “I” to the number one spot. Perhaps. But I would be surprised if “I,” “me,” and “my” all clustered in the top 10 in a typical auto-biography—a narcissistic genre by design, yet I imagine that self-aware authors attempt to balance the “I” with a pro-social dose of “thou.” Maybe I’m wrong. It would be worth checking.

More lexical dispersion plots . . .

Much more negation is seen below then is typically found in texts. According to Michael Halliday, most text corpora will exhibit 10% negative polarity and 90% positive polarity. Elliot’s manifesto, however, bursts with negation. Also notice, below, the constant references to “mother” and “father”—his parents are central characters. But not “mom” and “dad.” I’m from Southern California, born and raised, with social experience across the races and classes, but I’ve never heard a single English-only speaker refer to parents as “mother” and “father” instead of “mom” and “dad.” Was Elliot bilingual? Finally, note that Elliot prefers “girl/s” to “woman/en.”





Until I discover that auto-biographical texts always drip with personal pronouns, I would argue that Elliot’s manifesto is the product of an especially narcissistic personality. The boy couldn’t go two sentences without referencing himself in some way.

And what about the misogyny? He uses masculine pronouns as often as he uses feminine pronouns; he refers to his father as often as he refers to his mother—although, it is true, the references to mother become more frequent, relative to father, as Elliot pushes toward his misogynistic climax. Overall, however, the rhetorical energy in the text is not expended on females in particular. This is not an anti-woman screed from beginning to end. Also, recall, the preferred term is “girls,” not “women.” Elliot hated girls. Women—middle-aged, old, married, ensconced in careers, not apt to wear bikinis on the Santa Barbara beach—are hardly on Elliot’s radar. (This ageism also comes through in his YouTube videos.) Despite the “I hate all women” rhetorical flourishes at the very beginning and the very end of his manifesto, Elliot prefers to write about girls—young, blonde, unmarried, pre-career, in sororities, apt to wear bikinis on the Santa Barbara beach.

I noticed something similar in the Unabomber manifesto. Not about the girls. About the beginning and ending: what we remember most from that manifesto is its anti-PC bookends, even though the bulk of the manifesto devotes itself to very different subject matter. The quotes pulled from manifestos (including this one) and published by news outlets are a few subjective anecdotes, not the totality of the text .

Anyway. Pieces of writing that sally forth from such diseased individuals always call to mind what Kenneth Burke said about Mein Kampf:

[Hitler] was helpful enough to put his cards face up on the table, that we might examine his hands. Let us, then, for God’s sake, examine them.


A possible explanation for the emergence of quotative “like” in American English

So Monica was like, “What are you doing here, Chandler?” and Chandler was like, “Uhh nothing” and then Monica was like, “Why are you here with Phoebe?” and Chandler was like, “I don’t know,” and Monica was like, “Whatever!”

Quotative “be like” probably gets on your nerves. Unfortunately for you, it spread like wildfire in the latter half of the 20th century and today is used by native and non-native speakers alike as often as they use traditional say-type quotatives. What is its structure, when did it arise, and why did it spread so quickly? This post offers a possible explanation, based on evidence dragged up from the depths of the Google Books Corpus. To appreciate that evidence, however, we need to start with some discussion of this quotative’s formal properties.


One interesting property of quotative “be like” is its ambiguous semantics. In some contexts, it is a stative predicate that denotes internal speech, i.e., thoughts reflexive of an attitude. In other contexts, it is an eventive predicate denoting an actual speech act. Sometimes, the denotation is ambiguous, as in (1):

(1) Monica was like, “Oh my God!”

. . . Did Monica literally say “Oh my God!” or did she just think or feel it?

Another interesting property of quotative “be like” is that it disallows indirect speech.

(2a) Monica was like, “I should go to the mall.”

(2b) *Monica was like that she should go to the mall.

(2c) *Monica was like she should go to the mall

Quotative say of course allows indirect speech:

(3a) Monica said, “I should go to the mall.”

(3b) Monica said that she should go to the mall.

(3c) Monica said she should go to the mall.

Haddican et al. (2012) recognize that quotative “be like” is immune to indirect speech due to its mimetic implicature. (2b) cannot be allowed because quotative “be like” always means something more along these lines:

(4) Monica was like: QUOTE

Given the implied mimesis of this construction, it makes no sense, as in (2b) and (2c), to add an overt complementizer and to change person/tense to produce an indirect, third person report. This property is shared by all uses of quotative “be like,” whether in their stative or eventive readings.

But there’s more to it than a mimetic implicature. Schourup (1982) points out that quotative “go” also shares this mimetic property (although he does not frame it as such). As expected of a quotative with a mimetic implicature, quotative “go” likewise does not allow an indirect speech interpretation via addition of an overt complementizer and shifts in person/tense:

(5a) Monica goes, “I should go to the mall.”

(5b) *Monica goes that she should go to the mall.

Why should these innovative quotatives be so immune to indirect speech and so committed to direct quote marking? Schourup suggests that quotative “go” (and, by extension, quotative “be like”) arose precisely to meet English’s need for a mimetic, unambiguous direct quotation marker. Prior to the occurrence of these new quotatives, English lacked such a marker. Consider (6a) and (6b) below:

(6a) When I talked to him yesterday, Chandler said that you should go to the doctor.

(6b) When I talked to him yesterday, Chandler said you should go to the doctor.

There is no ambiguity in (6a). The speaker of this utterance clearly intends to convey to his interlocutor that Chandler said the interlocutor should go to the doctor. (6b), however, introduces ambiguity. The utterance in (6b) can be interpreted in two ways: a) Chandler said the speaker of the utterance (i.e., I) should go to the doctor; b) Chandler said the speaker’s interlocutor (i.e., you) should go to the doctor. With orthographic conventions, of course, this ambiguity disappears:

(6c) When I talked to him yesterday, Chandler said, “You should go to the doctor.” (So I went.)

However, unlike other languages, spoken English has no “quoting” conventions—it has no direct quote markers for unmarked speech. It is unclear if (6b) is a true quotative or merely an indirect report on speech with a null complementizer.


We can imagine speakers needing to clarify this ambiguity:

JOEY: When I talked to him yesterday, Chandler said you should go to the doctor.

ROSS: Wait, he said I should go or you should go?

This ambiguity arises with say-type verbs whenever the complementizer that is omitted. It is traditionally understood that English differentiates between direct quotatives and indirectly reported speech via shifts in person and/or tense. However, the overt complemetizer is really the central feature of this differentiation. Without an overt complementizer, it is never entirely clear if the embedded clause is a direct quote or an indirect report of speech. Here’s another example:

(7) JOEY: Chandler said I will be responsible for the cat’s funeral.

Without the aid of quote marks, we cannot know whether Chandler or Joey is responsible for the cat’s funeral, even though the embedded clause contains a shift in both person and tense. Of course, if Joey wants to convey that Joey himself will be responsible for the cat’s funeral, he can simply add the overt complementizer: “Chandler said that I will be responsible . . .” However, if Joey wants to convey that Chandler has decided to be responsible, Joey has no way to convey it unambiguously with say-type verbs. He must resort to an indirect speech construction with an overt complementizer. Alternatively, he can resort to non-structural signals: a short pause, a change in intonation, or a mimicry of Chandler’s voice. Or he must abandon say-type constructions altogether and convey his meaning some other way.

Quotative “go” and quotative “be like” solve this ambiguity. These innovative quotatives always signal that the following clause is mimetic, a direct quote of speech or thought. Many languages—Russian, Japanese, Georgian, Ancient Greek, to name just a few— have overt markers to ensure that interior clauses are understood as being directly quoted material, whether or not those quoted clauses contain grammatical shifts (though of course they often do). The quotatives “go” and “be like” serve this same purpose. They are structural, unambiguous markers for direct speech, which is why one cannot use them for indirect speech, and which is also why they have spread so widely and quickly: they have met a real need in the language.

Quotative “go,” however, is attested long before quotative “be like.” The Oxford English Dictionary puts the earliest usage in the early 19th century, initially as a way to mime sounds people made, then later as a way to report on actual speech. Here’s an example from Dickens’ Pickwick Papers:


So, although I have said that both quotative “be like” and quotative “go” met a need in English for an unambiguous direct quotation marker, it was “go” that in fact met the need first, by at least a century. This historical fact leads me to suspect that quotative “be like” met a slightly different need: while quotative “go” became a direct quotation marker for speech acts, quotative “be like” became a direct quotation marker for thoughts. As Haddican et al. rightly note, an innovative feature of these quotatives is that they allow direct quotes to be descriptors of states. In other words, the directly marked quotes of “go” denote external speech; the directly marked quotes of “be like” primarily denote internal speech, i.e., thoughts or attitudes. I believe this hypothesis is supported by the earliest uses of quotative “be like,” to which we now turn:


Today, young native and non-native speakers of English frequently use “like” as a versatile discourse marker or interjection in addition to its use as a quotative (D’Arcy 2005). D’Arcy provides two extreme examples of discourse marker “like.” Both are taken from a large corpus of spoken English:

(8) I love Carrie. LIKE, Carrie’s LIKE a little LIKE out-of-it but LIKE she’s the funniest, LIKE she’s a space-cadet.      Anyways, so she’s LIKE taking shots, she’s LIKE talking away to me, and she’s LIKE, “What’s wrong with you?”

(9) Well you just cut out LIKE a girl figure and a boy figure and then you’d cut out LIKE a dress or a skirt or a coat, and LIKE you’d colour it.

This usage does not become noticeable in available corpora until the 1980s, so nearly all papers that I have read assume that discourse marker “like” and qutoative “be like” arose more or less in tandem during the 1970s, becoming common by the 1980s. However, using the Google Books Corpus, I was able to find an early use of “like” that presages quotative “be like.” This early use also seems to set the stage for the versatile discursive uses of “like” seen in (8) and (9). This early use is the expression, “like wow.” It seems to have arisen during the 1950s (though perhaps earlier) in the early rock n’roll scenes in the Southern United States. Here are some examples.

The first is from 1957: a line from a rock n roll song by Tommy Sands:

(10) When you walk down the street, my heart skips a beat—man, like wow!

The second is from a 1960 issue of Business Education World:

(11) Like, wow! I’m taking a real cool course called general business. It’s the most.


The third is from a novel called The Fugitive Pigeon, published in 1965:

(12) But all of a sudden you’re like wow, you know what I mean?

And by 1971, we have a full example of quotative “be like,”— note that this early occurrence uses an expletive as the subject:

(13) But to me it was like, “Oh, why can’t you say, ‘Gee that’s wonderful . . .’”


These early uses of “like wow” in (10) and (11) denote a stative feeling or attitude rather than any kind of eventive speech act. This is especially clear in (11), where the expression is a direct response to a question about how the speaker is feeling. The quotative in (13) likewise seems to be a stative predicate rather than an eventive one. In fact, in nearly all of the earliest 1uses of quotative “be like”—from the 1970s and early 1980s, as reported in the Google Books Corpus—the intention is to denote a feeling or attitude, not a direct quote of a speech act. Such eventive predications don’t become common until the 1990s and 2000s.

“Like wow,” then, arose in 1950s slang as a stative description. However, the sentence in (14) below suggests that wow was not interpreted as a structurally independent interjection but as an adjective. This is from a 1960 edition of Road and Track magazine:

(14) Man, that crate would look like wow with a Merc grille.


It is possible that like is an adverb here, but in my estimation it is most likely still a garden variety manner preposition that has innovatively selected for a bare adjective. Typically, like as a preposition only selects NPs as its complement. However, with the advent of “like wow,” it loosened its selection requirements and began to select for adjectives as well. And not just adjectives. The bottom line in this advertisement from Billboard magazine in May 1960 demonstrates that it also began to select for adverbs:


Apparently, in the 1950s and early 1960s, like became a popular and versatile manner preposition. Once like loosened its requirements to select AP complements, it’s easy to see how it could start selecting quotes, thus becoming a new direct quote marker (like narrative “go”); and given the stative denotation of the original phrase “like wow,” it’s also easy to see why stative to be would become the verbal element in this quotative rather than a lexical verb like act or go. Indeed, it appears that the first uses of quotative “be like” were entirely restricted to the phrase “like wow,” ensuring that subsequent uses would likewise have stative readings. (The ad above also shows how easy it would be for like to become an all-around discourse marker once it began to select for a wider range of phrases.)

So, based on the timeline of evidence in the corpus, I posit the following evolution:


The emergence of quotative “like”

I follow Haddican et al. in assuming that like in quotative “be like” is still a manner preposition. However, while they assume the preposition did not undergo any change, I argue that like became more versatile in its selection restrictions. This versatility allowed it first to select APs, then to select quotes. Initially, this quotative construction was just an extension of the phrase “like wow,” but it soon began to select any quoted material. And from the beginning, this quotative possessed two features: a) it had an obvious mimetic implicature, ensuring that it would be a direct quote marker, similar to narrative “go”; and b) it had a stative denotation, due to the stative dentation of the original phrase “like wow,” ensuring that the directly marked quotes were reflective of internal speech, i.e., thoughts or attitudes.

A corpus analysis by Buchstaller (2001) has shown that, even today, quotative “go” is much more likely than quotative “be like” to frame “real, occurring speech” (pp. 10); in other words, “be like” continues to be used more often as a stative rather than eventive predicate. As I mentioned earlier, Haddican et al. are correct that one innovative aspect of quotative “be like” is that quotes are now able to be descriptors of states; however, I believe they overstate the eventive vs. stative ambiguity that arises in these quotatives. Most of the time, in real contexts, they are as unambiguously stative as they are unambiguously mimetic of the state. Haddican et al. themselves note that even these eventive readings are open to clarification. Asking whether or not someone “literally” said something sounds much odder following a say-type quotative than a “be like” quotative with a putatively eventive reading.


Nevertheless, as I showed at the very beginning of this post, there are instances where quotative “be like” seems to denote an eventive speech act. Linguistically, this is odder than it sounds at first. A single verbal construction—like quotative “be like”—should not have a stative and eventive reading. This ambiguity can only happen for two reasons: either there is some special semantic function at work in this construction, or there are in fact two separate quotative constructions, each with its own syntactic structures.

It is tempting to see a correlation between this ambiguity and the putative ambiguity between stative be and eventive be, also known as the be of activity. Consider the following sentences:

(15) Joey was silly.

(16) Rachel asked Joey to be silly.

Both forms of be select an adjective; however, (16), unlike (15), can be taken to mean that Joey performed some silly action. In other words, the small clause in (16) seems to be an eventive predication, not a stative one. It has been argued (Parsons 1990) that this eventive be is not the usual copular form but a completely different verb that means something like “to act”—in other words, English to be is actually a homophonous pair of verbs, similar to auxiliary have and possessive have. Perhaps this lexical ambiguity in be is related to the eventive vs. stative ambiguity in quotative “be like.” The stative reading arises when stative be is involved; the eventive reading arises when the eventive, lexical be is involved.

Haddican et al. argue against this line of thought. Diachronically, we know that quotative “be like” has arisen rapidly in many varieties of English, and that in all of these varieties, the semantics are ambiguous. But if there are in fact two be verbs that underwent this quotative innovation, then we would need to posit two unrelated channels of change: one in which like+QUOTE became a possible complement of stative be and one in which like+QUOTE became a possible complement of eventive be.

This is actually a problematic claim, given that, presumably, stative and eventive be have different structures. The former undergoes its typical V to T movement in English; the latter, given its eventive semantics, would be expected to remain in the VP like any other lexical verb. These underlying structures would demand that we devise different processes by which qutoative “be like” arose. However, given the rapidity with which it did in fact arise, it is more probable that it arose via a single process—and the inevitable conclusion is that there is a single, stative verb to be that underwent the process. This conclusion is also verified by the auxiliary-like behavior of be in quotatives involving adverbs and questions:

(17) Ross was totally like, “I don’t care!”

(18) Was Ross like, “I don’t care”?

Although the ambiguous stative vs. eventive reading still occurs here, (17) exhibits raising above AdvP, and (18) exhibits subject-aux inversion. In other words, be in these quotatives behaves like an ordinary copular auxiliary, not a lexical verb. We therefore should not posit a separate, eventive be verb. We need another way to explain the semantic ambiguity of these quotatives.

Haddican et al. explain this ambiguity with Davidsonian semantics. Briefly stated, they argue that there is a single stative be verb—both in these qutoative constructions and in English more generally. However, be has a semantic LOCALE function that, in certain contexts, can localize the state in a short-term event, and this localization of an event can force an agentive role onto the subject, even when an adjective has been selected by be. So, in a sentence such as (19), be will have a denotation as in (20):

(19) Joey is being silly.

(20) [[be]] = λSλeλx. ∃s ϵS [e = LOCALE(s) & ARGUMENT(x,e)]

(20) takes a property of state S and localizes it into an event (a moment in which Joey was silly); in the right context, it is not a great leap to coerce this experiencer event into an agentive one. The application of these semantics to “(be) like” quotatives is straightforward:

In the state reading, be like is simply a stage level use of the copula, localised to the event in which the subject of be exhibited the relevant behaviour. The eventive reading arises when the event mapped to is an agentive one, where the most plausible event of an agent behaving in a quotative manner is the relevant speech act. (Haddical et al. 2012 pp. 85)

In short, the ambiguity between stative and eventive “be like” arises from a semantic property that forces certain “states of being” to be processed as localized events whereby the experiencer of the event takes on an agentive role. In certain quotative contexts, the embedded quote is processed as an event, and the subject is understood as having caused that event, i.e, as actually saying something rather than just experiencing an attitude.

I agree that it would be better not to posit two homophonous verbs (stative be vs. be of activity) to account for the ambiguous stative vs. eventive denotations of quotative “be like.” Doing so requires two separate analyses and two separate channels of diffusion, which seems unlikely given the rapidity with which this quotative did in fact spread across many varieties of English. However, Haddican et. al’s application of Davidsonian semantics to explain the ambiguous readings runs into a problem in sentences like (21) below, as well as in the earlier example in (13):

(21) It was like, “Oh Mom, Can I film a movie in the house, it won’t be any problem at all.”

This is clearly an eventive predication of quotative “be like.” But instead of an agentive subject we have expletive it. Recall that Haddican et. al’s analysis relies on the notion that stative be has a LOCALE function that locates the state into a temporary moment or event. This localization can coerce an experiencer subject into the role of an agentive subject when the most likely reading (as above) suggests that the temporary event was an actual speech act. As Haddican et al. say themselves, “this event assigns an agentive role to the subject” (pp. 85). However, by definition, the expletive in (21) receives no theta role and can therefore be neither the experiencer of a state nor the agent of an event. And yet (21) clearly denotes an eventive reading: the speaker actually spoke the words, or something like them.

The fact that “be like” quotatives can take an eventive (or even a stative) reading when an expletive surfaces in spec-TP suggests that Davidsonian semantics do not explain the ambiguous eventive vs. stative readings associated with these quotatives. (The fact that “be like” quotatives exhibit both experiencer subjects and expletive subjects also suggests that the quote CP is the only obligatory argument assigned by “be like.”)

The only alternative seems to be that there are in fact two homophonous be verbs, and quotative “be like” makes use of both. Maybe this isn’t such a big deal. If I’m right about the diachronic process by which quotative “be like” arose, then we can at least see a two-step process: quotative “be like” was solely a stative predicate in its early use and for most of its early history; only later did it begin to be used as an eventive predicate. And if there are in fact two be verbs, the eventive sounds exactly like the stative and is in fact much rarer than the stative, so I suppose one can see how these facts laid the groundwork for the eventual use of stative “be like” as an eventive predicate.

Historical Linguistics and Population Genetics

Reich et al.  provide a model of two ancient populations in India that are ancestral to modern populations—Ancestral North Indians (ANI) and Ancestral South Indians (ASI). According to Reich et al, ANI is, on average, more genetically similar to Middle Easterners, Central Asians, and Europeans. ASI, on the other hand, is distinct from ANI as well as from East Asian populations. This same study found that “ANI ancestry ranges from 39–71% in most Indian groups, and is higher in traditionally upper caste and Indo-European speakers.” Furthermore, Reich et al. showed that the Indian caste system is old and historically implacable—high FST values indicate that “strong endogamy must have shaped marriage patterns in India for thousands of years.” This seriously contradicts the claims of Edward Said, Nicholas Dirks, and others who have argued that caste in India was more fluid and less systematized before British imperial rule.

However, a recent paper (Moorjani et al. 2013) does show fluid population admixture between Indian groups somewhere between 1,900 and 4,200 years ago.

Our analysis documents major mixture between populations in India that occurred 1,900 – 4,200 years BP, well after the establishment of agriculture in the subcontinent. We have further shown that groups with umixed ANI and ASI ancestry were plausibly living in India until this time. This contrasts with the situation today in which all groups in mainland India are admixed. These results are striking in light of the endogamy that has characterized many groups in India since the time of admixture. For example, genetic analysis suggests that the Vysya from Andhra Pradesh have experienced negligible gene flow from neighboring groups in India for an estimated 3,000 years. Thus, India experienced a demographic transformation during this time, shifting from a region where major mixture between groups was common and affected even isolated tribes such as the Palliyar and Bhil to a region in which mixture was rare.

As the researchers go on to indicate, ~2,000 to 3,000 years ago corresponds to the major transitions attendant to the end of the Harappan civilization and the influx of the Indo-Aryans. Can these genetic studies shed any light on the controversies of Indian language history?

Emeneau’s famous 1956 paper, “India as a Linguistic Area,” holds up reasonably well to contemporary scrutiny. The Indo-Aryan, Dravidian, and Munda language families have obviously influenced one another. Dravidian influence on Indo-Aryan is well attested. But this seems odd given the correlation, discovered by Reich et al. and others, between Indo-European speaking ancestry and upper caste status in India. Another population genetics study (Bamshad et al. 2001) puts it this way:

Indo-European-speaking people from West Eurasia entered India from the Northwest and diffused throughout the subcontinent. They purportedly admixed with or displaced indigenous Dravidic-speaking populations. Subsequently they may have established the Hindu caste system and placed themselves primarily in castes of higher rank.

These “Indo-European-speaking people” probably have something to do with Reich et al.’s Ancestral North Indians. But if these “invaders” were strong enough to admix with and displace the indigenous Dravidic-speaking populations, why does Emeneau find Dravidian influence on Indo-Aryan? Imagine Cherokee influencing English on the scale of 5%. It’s just not going to happen. Most linguistic history shows that dominant languages influence less dominant languages; the opposite rarely occurs, and if it does, its influence on the dominant language is minimal.  In another paper, Emeneau has this to say:

[There has long been the assumption] that the Sanskrit-speaking invaders of Northwest India were people of a high, or better, a virile, culture, who found in India only culturally feeble barbarians, and that consequently the borrowings that patently took place from Sanskrit and later Indo-Aryan languages into Dravidian were necessarily the only borrowings that could have occurred . . . It was but natural to operate with the hidden, but anachronistic, assumption that the earliest speakers of Indo-European languages were like the classical Greeks or Romans—prosperous, urbanized bearers of a high civilization destined in its later phases to conquer all Europe and then a great part of the earth—rather than to recognize them for what they doubtless were–nomadic, barbarous looters and cattle-reivers whose fate it was through the centuries to disrupt older civilizations but to be civilized by them.

Rather than the image of Indo-European “invaders” whose civilized power subjugated indigenous Indian populations, Emeneau instead imagines barbarians at the gates. Certainly, the language of nomads would be more socially susceptible to indigenous Dravidian, but how does this picture fit with the recent discovery of early population admixture? Would indigenous Dravidians have been more likely to breed freely with uncivilized nomads roaming and slowly penetrating the borderlands? Possibly.

Michael Witzel might have a different solution. The oldest Indian text following the actual Harappan script itself is the Rigveda, a collection of sacred Vedic Sanskrit hymns. Witzel finds in the earliest sections of the Rigveda several hundred lexical items and a few morphological features that are clearly not of Sanskrit (and therefore, not of Indo-European) origin. His analysis of these features leads him to believe that the language spoken before the arrival of Indo-Europeans—i.e., spoken in the Harappan civilization—was more closely related to the Munda languages and the Austroasiatic language family. In other words, Witzel’s analysis suggests that an Indo-European “invasion” and domination of indigenous Dravidian speakers is probably not an accurate historical picture. A sacred Indo-European text like the Rigveda would not contain so many non-IE loanwords if its speakers had entered the scene as dominant bringers of hierarchy. And given that the non-IE loanwords and morphological features are more likely Austroasiatic than Dravidian, Witzel envisions a time when Indo-European speakers and Dravidian speakers immigrated slowly into Harappan civilization, neither dominant invaders nor barbarous raiders. This would explain the cross-linguistic influence in the Indian subcontinent. It would also explain Moorjani et al.’s recent paper showing major mixture between groups in India prior to the rise of the caste system several thousand years ago.

Or maybe not. Witzel’s theory is not well accepted among historical linguists. And if Indo-Aryan and Dravidian immigration was so gradual and perhaps even egalitarian (Witzel imagines that Harappan urban centers may have been trilingual), from whence came a caste system that so clearly favors one ancestral group over the others? And there’s a nagging question about timing: one study suggests that Reich’s ANI might not fit within the purported timeline of Indo-European speakers’ migration. There’s also the issue of linguistic distribution. Razib Khan notes:

It seems an almost default position by many that the Austro-Asiatics are the most ancient South Asians, marginalized by Dravidians, and later Indo-Europeans. I would not be surprised if it was actually first Dravidians, then Austro-Asiatics and finally Indo-Europeans. Dravidians are found in every corner of the subcontinent (Brahui in Pakistan, a few groups in Bengal, and scattered through the center) while the Austro-Asiatics exhibit a more restricted northeastern range.

It’s all quite messy, but my point is that linguists interested in language contact and linguistic evolution should be reading work in population genetics, too. Papers on population genetics often reference work in historical linguistics; however, I rarely see historical linguists citing population genetics.

Grammatical Anaphors without C-command

More on Chomsky’s Binding Theory. It’s a good example of how generative rules are constantly formulated and re-formulated in light of new evidence—languages are infinite, there’s always new evidence—a seemingly endless process that to my mind undermines the entire concept of Universal Grammar (though not the fact of linguistic structure).

To undermine Binding Theory in particular, here’s a piece of evidence that complicates Binding Principle A. Of course, many linguists have presented reams of evidence to complicate Principle A as traditionally construed, but I’ve never seen this particular data-point, which, I think, complicates not only Principle A but also the centrality of c-command to anaphor distribution, which is what Principle A is supposed to account for.

Principle A states that one copy of a reflexive in a chain must be bound within the smallest CP or DP containing it and a potential antecedent. A reflexive is bound if it is co-indexed with and c-commanded by its antecedent Determiner Phrase (DP). Co-indexation simply means that both DPs refer to the same entity (e.g. , John and himself). C-command is a structural relation. In a syntax tree, a node c-commands its sister node and all the nodes dominated by its sister. In practical terms, a phrase in English will usually but not always c-command all the other words and phrases to the right of it (e.g., all the words spoken after the phrase):


According to Principle A, a reflexive pronoun (also called an anaphor in generative linguistics) must be bound in its domain. It must be co-indexed with and c-commanded by another DP:

CCommandBindingIn the sentence The girl loves herself, the anaphor is co-indexed with and c-commanded by its antecedent DP. Thus, the sentence is grammatical. The anaphor cannot refer to anyone but the girl. If you wanted the anaphor to refer to everything but the girl—that is, if you added a different index to the anaphor—then you would need to change the anaphor to a pronoun, it or her, to make the sentence grammatical: The girl loves it.

The sentence *Herself loves the girl is ungrammatical, according to Principle A, because herself c-commands the girl. But it’s supposed to be the other way around: the anaphor needs to be c-commanded. It’s not, so the sentence doesn’t work.

The notion of c-command is a vital component of nearly all theories of pronoun and anaphor distribution, even the ones that have completely overhauled Chomsky’s original Binding Principles. But look at the grammatical examples in (1) and (2) below:

(1) There was a man in an attic searching through an old photo album. Surprisingly, the man’s search turned up images of himself and not his son, like he had expected.

(2) The photographer thought his lab was developing pictures of his girlfriend. Surprisingly, the photographer’s lab developed pictures of both his girlfriend and himself.

The man’s search and The photographer’s lab are possessor DPs. They have the following structure:


With possessor DPs, the possessor is actually a second DP embedded within the DP that expresses the possessor-possessee relationship. In other words, the photographer is embedded lower in the tree than the photographer’s lab. I said a moment ago that a phrase in English will usually but not always c-command all the words and phrases to the right of it. The two examples above fall under “but not always”:


In (2), the photographer only c-commands lab; it is embedded too deep to c-command anything else. In (1), the man c-commands search; it is embedded too deep to c-command anything else. Neither DP c-commands into the Verb Phrase, which means that neither DP c-commands the anaphor embedded within the Verb Phrase. The anaphors in (1) and (2) are not c-commanded and thus not bound. This should trigger a Principle A violation, but according to my judgment and the judgment of several informants, (1) and (2) sound just fine.

If anaphor distribution truly relied on c-command, then (1) and (2) above should sound just as awful as *Herself loves the girl.

I said at the beginning that Chomsky’s Binding Theory has been called into question for many years now, but as far as I know, most attempts to re-theorize it continue to rely on c-command as an important structural element for describing constraints on anaphor distribution. However, the data presented here demonstrate that anaphors can still sound grammatical even when they are not c-commanded. This indicates that discursive contexts can override the constraints of c-command on anaphor distribution.

Binding Reflexives and Herding Cats

Chomsky’s insight is that language possesses structure independent of meaning. Take the examples below:

(1a) There seems to be a girl in the garden

(1b) ??There seems to be Kate in the garden

(1c) ??There seems to be the boy in the garden

(1d) *There seems to be him in the garden

The only difference between these sentences is the noun in the garden—a girl, Kate, the boy, and him. So why does (1a) sound perfectly fine while the others sound off? Why does (1d) sound thoroughly ungrammatical? There must be structural elements involved here that are not visible in the words themselves.

Another, famous example:

(2a) Colorless green ideas sleep furiously

(2b) *Colorless ideas green furiously sleep

(2c) *Colorless green ideas sleeps furiously

Each sentence is meaningless. Yet most English speakers will agree that (2a) is fine while (2b) is word salad, and that in (2c), there’s something wrong with the verb. Again, the only reason why a meaningless sentence can still sound wrong or right is that the structure of language is at least partially independent of its meaning. From this hypothesis follows the concept of universal grammar—all human groups exhibit language, and if languages exhibit structure independent of meaning, then at a deep level, all human languages, beneath their superficial diversity, might operate upon the same structures. The goal of “Chomskyan” or “formalist” linguistic analysis is to describe the structure of this universal grammar (UG).

An adequate structural model of a language (and, eventually, of all languages) will consist of rules that can generate the grammatical sentences in the language while at the same time barring ungrammatical sentences from being generated. For the last several decades, the work of linguistics in North America and much of Europe has centered around discoveringand describing these generative rules. The problem is, that when one scholar has got a rule just right (it correctly predicts which sentences will be grammatical and which ones will be filtered out as ungrammatical), some other scholar pops up with new data showing a grammatical or ungrammatical sentence that shouldn’t exist according to the rule. And so the rule gets re-worked, made more complex, or abandoned in favor of some other rule . . . which awaits its destruction at the hands of some bizarre sentence that should or should not be grammatical.

It’s obvious that languages have structure. What’s not so obvious is that linguistic structure can be described with a closed system of rules. In the humble opinion of this blogger, trying to model UG is like trying to herd cats. Maybe you can herd most of them, but there’s always a few that just hiss and run away, and their existence seems to undermine the premise of the whole endeavor.

Take reflexive pronouns, for example. If any linguistic element can be described with robustly predictive rules, it should be reflexives. By definition, reflexives are structural: they must refer to (i.e., be co-indexed with) some other noun phrase (NP) in a sentence; otherwise, they sound ungrammatical, as in *Himself went to the store.

It has long been noted that reflexive pronouns in English and many other languages appear in complementary distribution with personal pronouns, which don’t need to co-refer with another noun phrase in a sentence:

(3a) Michael loves himself

(3b) Michael loves him

In (3a), himself can only refer to Michael. In (3b), him cannot refer to Michael; it must refer to some NP other than Michael, an NP which needn’t exist in the same sentence. If you want him to refer to Michael, you don’t use him, you use reflexive himself.

This distribution of reflexives and personal pronouns is the basis of Chomsky’s Binding Theory, specifically Binding Principles A and B, which state, respectively, that reflexives must be c-commanded by their co-indexed NP within some local domain and that pronouns cannot be c-commanded by their co-indexed NP within some local domain. Defining “domain” is tricky. Once upon a time, it appeared that the domain was the clause:

(4) Michael said that he loves Mary

In (4), the pronoun he is indeed c-commanded by its co-indexed NP, Michael, but the sentence is still grammatical. Apparently, Binding Principle A only applies intra-clausally. The “domain” for the binding principles must therefore be the clause.

Binding Principle A: A reflexive pronoun must be c-commanded by its co-indexed NP within the clause that immediately contains both the reflexive and its antecedent.

Binding Principle B: A pronoun must not be c-commanded by a co-indexed NP within the same clause.

An NP that is both c-commanded by and co-indexed with another NP is said to be “bound” by the second NP, its antecedent. Binding Principles A and B can by glossed in simpler terms by saying that a reflexive pronoun must be bound within its clause, and a personal pronoun must not be bound (or “must be free”) within its clause. As formulated, these rules correctly predict the grammaticality of many, many sentences cross-linguistically.

But not all of them:

(5) Michael loves his snake

The pronoun his is bound by Michael within the same clause. That’s a violation of Principle B. (5) should not be grammatical. But it’s grammatical. Something’s wrong with Principle B. And what about the example in (6):

(6) Mary thinks that the picture of herself look beautiful

The reflexive herself is in a separate clause from its binding NP, Mary. That’s a violation of Principle A. (6) should not be grammatical. But it is. Something’s wrong with Principle A, too.

Chomsky and others tried to tighten up the binding rules to account for these sentences by changing the definition of “domain.” I won’t go into all the details, but at the moment, standard linguistics textbooks describe the binding rules in the following way (these definitions come from Carnie):

Binding Principle A: One copy of a reflexive in a chain must be bound within the smallest CP or DP containing it anda potential antecedent.

Binding Principle B: A pronoun must be free (not bound) within the smallest CP or DP containing it but not containing a potential antecedent. If no such category is found, the pronoun must be free within the root CP.

Clearly, the only way to salvage the entire premise of the binding principles is to make them quite a bit more complicated. That’s not necessarily a mark against it. No one said linguistic structure would be simple or elegant.

However, these new and improved binding rules continue to rely on the notion that reflexives and pronouns will be bound or not bound within their domains. They also continue to predict that reflexives and pronouns will be in complementary distribution.


(7) Grand ideas about himself occupy John all day

(8a) John boasted that the Queen had invited Lucie and himself for tea

(8b) John boasted that the Queen had invited Lucie and him for tea

(8a) and (8b) demonstrate that pronouns and reflexives, in this case, are not in complementary distribution. (7) provides an example of a reflexive that is not bound by its co-indexed NP—himself occurs before John. It looks like even our new and improved (and more complex) binding rules fail to predict which sentences will or will not be grammatical. These examples could easily be multiplied. And we haven’t even left English!

Of course, linguists continue to re-formulate binding rules that take the above examples into consideration. But in order to herd these cats, things get very complicated very quickly, and many of the papers formulating new binding rules (e.g., Reinhart and Reuland 1993) contain a lot of sentences that begin with “Suppose that . . .” The suppositions may indeed be correct, and, as I said, there was never a guarantee that the rules of UG would be simple. However, for the past 40 years, North American linguistics has been a constant complication of older rules with newer rules as more data (especially cross-linguistic data) comes to the field’s attention. This process of formulation and re-formulation in light of new data, which I have simplistically  illustrated here with Binding Principles, is exactly what linguists do. This process may indeed be expanding our knowledge about the structures of languages and UG. I think it has provided a lot of insight into linguistic structures. But it seems like there can never be closure. There will always be another piece of data to demonstrate that a rule is incomplete or simply incorrect. And unfortunately, the impossibility that the rules being amassed will ever reach closure seems to undermine the entire process. One can’t help agreeing, if only momentarily, with John McWhorter’s warning that the search for the structures of Universal Grammar might look as silly to future scholars as the search for phlogiston looks to us today.

(Or it might not. I don’t know. In the end, the argument I made in the paragraph above is similar to the argument against trying to pin down polygenic traits in humans—it’s just too complicated. And that’s never a productive stance to take.)

Uploading a Corpus to the NLTK, part 2

A year ago I posted a brief tutorial explaining how to upload a corpus into the Natural Language Toolkit. The post receives frequent hits, and I’ve received a handful of emails asking for further explanation. (The NLTK discussion forums are not easy to navigate, and Natural Language Processing with Python is not too clear about this basic but vital step. It’s discussed partially and in passing in Chapter 3.)

Here’s a more detailed description of the process, as well as information about preparing the corpus for analysis:

1. The corpus first needs to be saved in a plain text format. Also, the plain text file needs to be saved in the main Python folder, not under Documents. The path for your file should look something like this: c:\Python27\corpus.txt 

Once the corpus is in the proper format, open the Python IDLE window, import the Natural Language Toolkit, then open the corpus and convert it into raw text using the following code:


Using the ‘type’ command, we can see that the uploaded corpus at this point is simply a string of raw text (‘str’) as it exists in the .txt file itself.


2. The next step involves ‘tokenizing’ the raw text string. Tokenization turns the raw text into tokens: each word, number, symbol, and punctuation mark becomes its own entity in a long list. The line ccctokens[:100] shows the first 100 items in the now-tokenized corpus. Compare this list to the raw string above, listed after cccraw[:150].


Tokenization is an essential step. Running analyses on raw text is not as accurate as running analyses on tokenized text.

3. Next, all the words in the tokenized corpus need to be converted to lower-case. This step ensures that the NLTK does not count the same word as two different tokens simply due to orthography. Without lower-casing the text, the NLTK will, e.g., count ‘rhetoric’ and ‘Rhetoric’ as different items. Obviously, some research questions would want to take this difference into account, but otherwise, skipping this step might muddy your results.


4. Finally, attach the functionality of the NLTK to your corpus with this line: nltk.Text(tokenname)

‘token name’ would be whatever you’ve named your file in the preceding lines. The definition ID’s used in the examples above (ccc, ccctokens, cccraw) can obviously be changed to whatever you want, but it’s a good idea to keep track of them on paper so that you aren’t constantly scrolling up and down in the IDLE window.

Now the corpus is ready to be analyzed with all the power of the Natural Language Toolkit.

Building a Chinese Room

Chomsky isn’t a fan of statistical machine learning. However, this video (via Steve Hsu) suggests that using Really Big Corpora is the best way to get machines to figure out how language works, both structurally and–as the video shows–phonetically and acoustically.

Around six minutes in, the demonstration begins. The speaker’s words are translated almost instantaneously into Chinese, and the auditory output sounds somewhat similar to the speaker’s actual voice. There are obviously Chinese speakers in the audience, and their response suggests that the demo was successful.

This video is a good example of the ways that computer scientists (and I include researchers in natural language processing in that category) are operating squarely in the realm of the humanities–what’s more humanistic than language translation? There have been tomes and manifestos written unto its spiritual, social, epistemological, and theoretical nature. And now computers are getting the hang of it. We humanists ignore their successes at our peril.

The Pareto distribution of native American language speakers

My post about native American language health gets the most hits on this blog, so I decided to do some minor editorial housekeeping on it last night. While I was fixing awkward syntax, however, I noticed something blatantly obvious about the first graph, which ranks living native languages according to most speakers:


It’s essentially a Pareto distribution, a long tail. I don’t know much about the mathematics underlying it. I only know that it arises naturally across an array of social, geographic, economic, and scientific phenomena. Derek Mueller recently wrote an article about this exact distribution amongst scholarly citations in the field of rhetoric and writing. “Conceptually,” he writes, “the long tail comes from statistics and graphing; it is a feature of a power law or Pareto distribution—graphed patterns that underscore the uneven distribution of some activity or phenomenon” (207). And yet this unequal distribution exists in phenomena as disparate as citations in academic journals and numbers of language speakers.

A power law writ deep in the mathematical fabric of things?

The language/genes metaphor (part 4)

Part IV: The basic building blocks of linguistic replication?

As I mentioned in the last post, I’m convinced that a language/phenotype analogy is more appropriate than a language/genes analogy. However, here’s a second piece of devil’s advocacy because I still believe a case can be made for the latter metaphor.

The language/genes metaphor is appropriate only if we assume that languages are the auditory manifestations of underlying linguistic structures, and that these structures replicate in the mind every time an individual learns a language, either as a first or second language. Entertaining this possibility means accepting Chomsky’s universal grammar hypothesis and his principles and parameters approach.

Does Chomsky’s approach provide some kind of mechanism whereby we can reduce linguistic structures to a few basic parts, the way we can reduce DNA and its replication to nucleotide bases and enzymes?

Yes, I think so. Phrase structure theory and X’ theory provide a framework for analyzing all human languages according to a few basic building blocks: phrases, phrase heads, complements, and specifiers. This is the “DNA” of language.


We don’t need to go into detail about this chart. Basically, all languages are built from phrase heads (X), which project to a phrase-bar (X’), which project to a phrase (XP). Phrase heads can optionally project a complement, and phrases can optionally project specifiers.

languageDNAFour phrase heads map onto—more or less—the grammatical categories we learn in school: verbs, nouns, prepositions, adverbs and adjectives (both categorized as AP). The other phrase heads are less well known. Tense phrases (aka Inflection Phrases) are an abstract category that allows verbs to inflect for tense and person. Complement phrases are projected from embedded clauses: words such as that or because are complementizers. And determiner phrases project from what we commonly call articles: the, a, an in English.

All human languages are built from these categories. All human languages can be analyzed according to the same basic rules of phrase projection. The difference between languages is the difference between various “parameter settings” for these phrases and projections, among other aspects of language structure that I haven’t talked about here.

One of the most salient cross-linguistic differences is, of course, word order. However, according to phrase structure and X’ theory, this difference is simply a matter of re-configuring or re-arranging the structural projections:

The structure of English word order (Subject - Verb - Object)

The structure of English word order
(Subject – Verb – Object)

The structure of Malagasy word order (Verb - Object - Subject)

The structure of Malagasy word order
(Verb – Object – Subject)

The structure of Japanese word order (Subject - Object - Verb)

The structure of Japanese word order
(Subject – Object – Verb)

So, for example, the difference between English and Japanese is simply the difference between an X’ that projects to X first and complement second (English) and an X’ that projects to complement first and X second (Japanese). In other words, English is a head-initial language and Japanese is a head-final language.


In a Chomskyan framework, there are basic building blocks of linguistic structures, and the differences between languages can be described as differences in the configuration and specification of these structures. Sounds roughly comparable to DNA in my opinion, but then, I’m not a geneticist . . .

(Note: the phrase table and the examples in this post are taken from the helpful notes of Dr. John Nissenbaum.)