An Attempt at Quantifying Changes to Genre Medium, cont’d.

Cosine similarity of all written/oral States of the Union is 0.55. A highly ambiguous result, but one that suggests there are likely some differences overlooked by Rule et al. (2015). A change in medium should affect genre features, if only at the margins. The most obvious change is to length, which I pointed out in the last post.

But how to discover lexical differences? One method is naive Bayes classification. Although the method has been described for humanists in a dozen places at this point, I’ll throw my own description into the mix for posterity’s sake.

Naïve Bayes classification occurs in three steps. First, the researcher defines a number of features found in all texts within the corpus, typically a list of the most frequent words. Second, the researcher “shows” the classifier a limited number of texts from the corpus that are labeled according to text type (the training set). Finally, the researcher runs the classifier algorithm on a larger number of texts whose labels are hidden (the test set). Using feature information discovered in the training set, including information about the number of different text types, the classifier attempts to categorize the unknown texts. Another algorithm can then check the classifier’s accuracy rate and return a list of tokens—words, symbols, punctuation—that were most informative in helping the classifier categorize the unknown texts.

More intuitively, the method can be explained with the following example taken from Natural Language Processing with Python. Imagine we have a corpus containing sports texts, automotive texts, and murder mysteries. Figure 2 provides an abstract illustration of the procedure used by the naïve Bayes classifier to categorize the texts according to their features. Loper et al. explain:

BayesExample

In the training corpus, most documents are automotive, so the classifier starts out at a point closer to the “automotive” label. But it then considers the effect of each feature. In this example, the input document contains the word “dark,” which is a weak indicator for murder mysteries, but it also contains the word “football,” which is a strong indicator for sports documents. After every feature has made its contribution, the classifier checks which label it is closest to, and assigns that label to the input.

Each feature influences the classifier; therefore, the number and type of features utilized are important considerations when training a classifier.

Given the SotU corpus’s word count—approximately 2 million words—I decided to use as features the 2,000 most frequent words in the corpus (the top 10%). I ran NLTK’s classifier ten times, randomly shuffling the corpus each time, so the classifier could utilize a new training and test set on each run. The classifier’s average accuracy rate for the ten runs was 86.9%.

After each test run, the classifier returned a list of most informative features, the majority of which were content words, such as ‘authority’ or ‘terrorism’.

However, a problem . . . a direct comparison of these words is not optimal given my goals. I could point out, for example, that ‘authority’ is twice as likely to occur in written than in oral States of the Union; I could also point out that the root ‘terror’ is found almost exclusively in the oral corpus. Nevertheless, these results are unusable for analyzing the effects of media on content. For historical reasons, categorizing the SotU into oral and written addresses is synonymous with coding the texts by century. The vast majority of written addresses were delivered in the nineteenth century; the majority of oral speeches were delivered in the twentieth and twenty-first centuries. Analyzing lexical differences thus runs the risk of uncovering, not variation between oral and written States of the Union (a function of media) but variation between nineteenth and twentieth century usage (a function of changing style preferences) or differences between political events in each century (a function of history). The word ‘authority’ has likely just gone out of style in political speechmaking; ‘terror’ is a function of twenty-first century foreign affairs. There is nothing about medium that influences the use or neglect of these terms. A lexical comparison of written and oral States of the Union must therefore be reduced to features least likely to have been influenced by historical exigency or shifting usage.

In the lists of informative features returned by the naïve Bayes classifier, pronouns and contraction emerged as two features fitting that requirement.

AllPronounsRelFreq

Relative frequencies of first, second, and third person pronouns

WeUsYouPronounsRelFreq

Relative frequencies of select first and second person pronouns

ContractionsRelFreq

Relative frequencies of apostrophes and negative  contraction

It turns out that pronoun usage is a noticeable locus of difference between written and oral States of the Union. The figures above show relative frequencies of first, second, and third person pronouns in the two text categories (the tallies in the first graph contain all pronomial inflections, including reflexives).

As discovered by the naïve Bayes classifier, first and second person pronouns are much more likely to be found in oral speeches than in written addresses. The second graph above displays particularly disparate pronouns: ‘we’, ‘us’, ‘you’, and to a lesser extent, ‘your’. Third person pronouns, however, surface equally in both delivery mediums.

The third graph shows relative frequency rates of apostrophes in general and negative contractions in particular in the two SotU categories. Contraction is another mark of the oral medium. In contrast, written States of the Union display very little contraction; indeed, the relative frequency of negative contraction in the written SotU corpus is functionally zero (only 3 instances). This stark contrast is not a function of changing usage. Negative contraction is attested as far back as the sixteenth century and was well accepted during the nineteenth century; contraction generally is also well attested in nineteenth century texts (see this post at Language Log). However, both today and in the nineteenth century, prescriptive standards dictate that contractions are to be avoided in formal writing, a norm which Sairio (2010) has traced to Swift and Addison in the early 1700s. Thus, if not the written medium directly, then the cultural standards for the written medium have motivated presidents to avoid contraction when working in that medium. Presidents ignore this arbitrary standard as soon as they find themselves speaking before the public.

The conclusion to be drawn from these results should have been obvious from the beginning. The differences between oral and written States of the Union are pretty clearly a function of a president’s willingness or unwillingness to break the wall between himself and his audience. That wall is frequently broken in oral speeches to the public but rarely broken in written addresses to Congress.

As seen above, plural reference (‘we’, ‘us’) and direct audience address (‘you’, ‘your’) are favored rhetorical devices in oral States of the Union but less used in the written documents. The importance underlying this difference is that both features—plural reference and direct audience address—are deliberate disruptions of the ceremonial distance that exists between president and audience during a formal address. This disruption, in my view, can be observed most explicitly in the use of the pronouns ‘we’ and ‘us’. The oral medium motivates presidents to construct, with the use of these first person plurals, an intimate identification between themselves and their audience. Plurality, a professed American value, is encoded grammatically with the use of plural pronouns: president and audience are different and many but are referenced in oral speeches as one unit existing in the same subjective space. Also facilitating a decrease in ceremonial distance, as seen above, is the use of second person ‘you’ at much higher rates in oral than in written States of the Union. I would suggest that the oral medium motivates presidents to call direct attention to the audience and its role in shaping the state of the nation. In other cases, second person pronouns may represent an invitation to the audience to share in the president’s experiences.

Contraction is a secondary feature of the oral medium’s attempt at audience identification. If a president’s goal is to build identification with American citizens and to shorten the ceremonial distance between himself and them, then clearly, no president will adopt a formal diction that eschews contraction. Contraction—either negative or subject-verb—is the informality marker par excellence. Non-contraction, on the other hand, though it may sound “normal” in writing, sounds stilted and excessively proper in speech; the amusing effect of this style of diction can be witnessed in the film True Grit. In a nation comprised of working and middle class individuals, this excessively proper diction would work against the goals of shortening ceremonial distance and constructing identification. Many scholars have noted Ronald Reagan’s use of contraction to affect a “conversational” tone in his States of the Union, but contraction appears as an informality marker across multiple oral speeches in the SotU corpus. In contrast, when a president’s address takes the form of a written document, maintaining ceremonial distance seems to be the general tactic, as presidents follow correct written standards and avoid contractions. The president does not go out of his way to construct identification with his audience (Congress) through informal diction. Instead, the goal of the written medium is to report the details of the state of the nation in a professional, distant manner.

What I think these results indicate is that the State of the Union’s primary audience changes from medium to medium. This fact is signaled even by the salutations in the SotU corpus. The majority of oral addresses delivered via radio or television are explicitly addressed to ‘fellow citizens’ or some other term denoting the American public. In written addresses to Congress, however, the salutation is almost always limited to members of the House and the Senate.

Two lexical effects of this shift in audience are pronoun choice and the use or avoidance of contraction. ‘We’, ‘us’, ‘you’—the frequency of these pronouns drops by fifty percent or more when presidents move from the oral to the written medium, from an address to the public to an address to Congress. The same can be said for contraction. Presidents, it seems, feel less need to construct identification through these informality markers, through plural and second person reference, when their audience is Congress alone. In contrast, audience identification becomes an exigent goal when the citizenry takes part in the State of the Union address.

To put the argument another way, the SotU’s change in medium has historically occurred alongside a change in genre participants. These intimately linked changes motivate different rhetorical choices. Does a president choose or not choose to construct a plural identification between himself and his audience (‘we’,’us’) or to call attention to the audience’s role (‘you’) in shaping the state of the nation? Does a president choose or not choose to use obvious informality markers (i.e., contraction)? The answer depends on medium and on the participants reached via that medium—Congress or the American people.

~~~

Tomorrow, I’ll post results from two 30-run topic models of the written/oral SotU corpora.

An Attempt at Quantifying Changes to Genre Medium

Rule et al.’s (2015) article on the State of the Union makes the rather bold claim (for literary and rhetorical scholars) that changes to the SotU’s medium of delivery has had no effect on the form of the address, measured as co-occurring word clusters as well as cosine similarity across diachronic document pairs. I’ve just finished an article muddying their results a bit, so here’s the initial data dump. I’ll do it in a series of posts. Full argument to follow, if I can muster enough energy in the coming days to convert an overly complicated argument into a few paragraphs.

First, cosine similarity. Essentially, Rule et al. calculate the cosine similarity between each set of two SotU addresses chronologically—1790 and 1791, 1790 and 1792, 1790, and 1793, and so on—until each address has been compared to all other addresses. They discover high similarity measurements (nearer to 1) across most of the document space prior to 1917 and lower similarity measurements (nearer to 0) afterward, which they interpret as a shift between premodern and modern eras of political discourse. They visualize these measurements in the “transition matrices”—which look like heat maps—in Figure 2 of their article.

Adapting a Python script written by Dennis Muhlestein, I calculated the cosine similarity of States of the Union delivered in both oral and written form in the same year. This occurred in 8 years, a total of 16 texts. FDR in 1945, Eisenhower in 1956, and Nixon in 1973 delivered written messages to Congress as well as public radio addresses summarizing the written messages. Nixon in 1972 and 1974, and Carter in 1978-1980 delivered both written messages and televised speeches. These 8 textual pairs provide a rare opportunity to analyze the same annual address delivered in two mediums, making them particularly appropriate objects of analysis. The texts were cleaned of stopwords and stemmed using the Porter stemming algorithm.

CosineSimilarityMetrics

Cosine similarity of oral/written SotU pairs

The results are graphed above (not a lot of numbers, so there’s no point turning them into a color-shaded matrix, as Rule et al. do). The cosine similarity measurements range from 0.67 (a higher similarity) to 0.40 (a lower similarity). The cosine similarity measurement of all written and all oral SotU texts—copied chronologically into two master .txt files—is 0.55, remarkably close to the average of the 8 pairs measured independently.

There is much ambiguity in these measurements. On one hand, they can be interpreted to suggest that Rule et al. overlooked differences between oral and written States of the Union; the measurements invite a deeper analysis of the corpus. On the other hand, the measurements also tell us not to expect substantial variation.

In the article (to take a quick stab at summarizing my argument) I suggest that this metric, among others, reflects a genre whose stability is challenged but not undermined by changes to medium as well as parallel changes initiated by the medial alteration.

But you’re probably wondering what this cosine similarity business is all about.

Without going into too much detail, vector space models (that’s what this method is called) can be simplified with the following intuitive example.

Let’s say we want to compare the following texts:

Text 1: “Mary hates dogs and cats”
Text 2: “Mary loves birds and cows”

One way to quantify the similarity between the texts is to turn their words into matrices, with each row representing one of the texts and each column representing every word that appears in either of the texts. Typically when constructing a vector space model, stop words are removed and remaining words are stemmed, so the complete word list representing Texts 1 and 2 would look like this:

“1, Mary”, “2, hate”, “3, love”, “4, dog”, “5, cat”, “6, bird”, “7, cow”

Each text, however, contains only some of these words. We represent this fact in each text’s matrix. Each word—from the complete word list—that appears in a text is represented as a 1 in the matrix; each word that does not appear in a text is represented as a 0. (In most analyses, frequency scores are used, such as relative frequency or tf-idf.) Keeping things simple, however, the matrices for Texts 1 and 2 would look like this:

Text 1: [1 0 1 1 1 0 0]
Text 2: [1 1 0 0 0 1 1]

Now that we have two matrices, it is a straightforward mathematical operation to treat these matrices as vectors in Euclidean space and calculate the vectors’ cosine similarity with the Euclidean dot product formula, which returns a similarity metric between 0 and 1. (For more info, check out this great blog series; and here’s a handy cosine similarity calculator.)

CosSimFormula

The cosine similarity of the matrices of Text 1 and Text 2 is 0.25; we could say that the texts are 25% similar. This number makes intuitive sense. Because we’ve removed the stopword ‘and’ from both texts, each text is comprised of four words, with one word shared between them—

Text 1: “Mary hates dogs cats”
Text 2: “Mary loves birds cows”

—thus resulting in the 0.25 measurement. Obviously, when the texts being compared are thousands of words long, it becomes impossible to do the math intuitively, which is why vector space modeling is a valuable tool.

~~~

Next, length. Rule et al. use tf-idf scores and thus norm their algorithms to document length. As a result, their study fails to take into account differences in SotU length. However, the most obvious effect of medium on the State of the Union has been a change in raw word count: the average length of all written addresses is 11,057 words; the average length of all oral speeches is 4,818 words. Below, I visualize the trend diachronically. As a rule, written States of the Union are longer than oral States of the Union.

SotUWordCountByYear.jpg

State of the Union word counts, by year and medium

The correlation between medium and length is most obvious in the early twentieth century. In 1913, Woodrow Wilson broke tradition and delivered an oral State of the Union; the corresponding drop in word count is immediate and obvious. However, the effect is not as immediate at other points in the SotU’s history. For example, although Wilson began the oral tradition in 1913, both Coolidge and Hoover returned to the written medium from 1924 – 1932; Wilson’s last two speeches in 1919 and 1920 were also delivered as written messages; nevertheless, these written addresses do not correspond with a sudden rebound in SotU length. None of the early twentieth century written addresses is terribly lengthy, with an average near 5,000.

The initial shift in 1801 from oral to written addresses also fails to correspond with an obvious and immediate change in word count. The original States of the Union were delivered orally, and these early documents are by far the shortest. However, when Thomas Jefferson began the written tradition in 1801, SotU length took several decades to increase to the written mean.

Despite these caveats, the trend remains strong: the oral medium demands a shorter State of the Union, while the written medium tends to produce lengthier documents. To date, the longest address remains Carter’s 1981 written message.

~~~

More later. Needless to say, I believe there are formal differences in the SotU corpus (~2 million words) that seem to correlate with medium. However, as I’ll show in a post tomorrow, they’re rather granular and were bound to be overlooked by Rule et al.’s broad-stroke approach.

Some questions about centrality measurements in text networks

Centrality

This .gif alternates between a text network calculated for betweenness centrality (smaller nodes overall) and one calculated for degree centrality (larger nodes). It’s normal to discover that most nodes in a network possess higher degree than betweenness centrality. However, in the context of human language, what precisely is signified by this variation? And is it significant?

Another way of posing the question is to ask what exactly one discovers about a string of words by applying centrality measurements to each word as though it were a node in a network, with edges between words to the right or left of it. The networks in the .gif visualize variation between two centrality measurements, but there are dozens of others that might have been employed. Which centrality measurements—if any—are best suited for textual analysis? When centrality measurements require the setting of parameters, what should those parameters be, and are they dependent on text size? And ultimately, what literary or rhetorical concept is “centrality” a proxy for? The mathematical core of a centrality measurement is a distance matrix, so what do we learn about a text when calculating word proximity (and frequency of proximity, if calculating edge weight)? Do we learn anything that would have any relevance to anyone since the New Critics?

It is not my goal (yet) to answer these questions but merely to point out that they need answers. DH researchers using networks need to come to terms with the linear algebra that ultimately generates them. Although a positive correlation should theoretically exist between different centrality measurements, differences do remain, and knowing which measurement to utilize in which case should be a matter of critical debate. For those using text networks, a robust defense of network application in general is needed. What is gained by thinking about text as a word network?

In an ideal case, of course, the language of social network theory transfers remarkably well to the language of rhetoric and semantics. Here is Linton C. Freeman discussing the notion of centrality in its most basic form:

Although it has never been explicitly stated, one general intuitive theme seems to have run through all the earlier thinking about point centrality in social networks: the point at the center of a star or the hub of a wheel, like that shown in Figure 2, is the most central possible position. A person located in the center of a star is universally assumed to be structurally more central than any other person in any other position in any other network of similar size. On the face of it, this intuition seems to be natural enough. The center of a star does appear to be in some sort of special position with respect to the overall structure. The problem is, however, to determine the way or ways in which such a position is structurally unique.

Previous attempts to grapple with this problem have come up with three distinct structural properties that are uniquely possessed by the center of a star. That position has the maximum possible degree; it falls on the geodesics between the largest possible number of other points and, since it is located at the minimum distance from all other points, it is maximally close to them. Since these are all structural properties of the center of a star, they compete as the defining property of centrality. All measures have been based more or less directly on one or another of them . . .

Addressing the notions of degree and betweenness centrality, Freeman says the following:

With respect to communication, a point with relatively high degree is somehow “in the thick of things”. We can speculate, therefore, that writers who have defined point centrality in terms of degree are responding to the visibility or the potential for activity in communication of such points.

As the process of communication goes on in a social network, a person who is in a position that permits direct contact with many others should begin to see himself and be seen by those others as a major channel of information. In some sense he is a focal point of communication, at least with respect to the others with whom he is in contact, and he is likely to develop a sense of being in the mainstream of information flow in the network.

At the opposite extreme is a point of low degree. The occupant of such a position is likely to come to see himself and to be seen by others as peripheral. His position isolates him from direct involvement with most of the others in the network and cuts him off from active participation in the ongoing communication process.

The “potential” for a node’s “activity in communication” . . . A “position that permits direct contact” between nodes . . . A “major channel of information” or “focal point of communication” that is “in the mainstream of information flow.” If the nodes we are talking about are words in a text, then it is straightforward (I think) to re-orient our mental model and think in terms of semantic construction rather than interpersonal communication. In other posts, I have attempted to adopt degree and betweenness centrality to a discussion of language by writing that, in a textual network, a word with high degree centrality is essentially a productive creator of bigrams but not a pathway of meaning. A word with high betweenness centrality, on the other hand, is a pathway of meaning: it is a word whose significations potentially slip as it is used first in this and next in that context in a text.

Degree and betweenness centrality—in this ideal formation—are therefore equally interesting measurements of centrality in a text network. Each points you toward interesting aspects of a text’s word usage.

However, most text networks are much messier than the preceding description would lead you to believe. Freeman, again, on the reality of calculating something as seemingly basic as betweenness centrality:

Determining betweenness is simple and straightforward when only one geodesic connects each pair of points, as in the example above. There, the central point can more or less completely control communication between pairs of others. But when there are several geodesics connecting a pair of points, the situation becomes more complicated. A point that falls on some but not all of the geodesics connecting a pair of others has a more limited potential for control.

In the graph of Figure 4, there are two geodesics linking pi with p3, one EJ~U p2 and one via p4. Thus, neither p2 nor p4 is strictly between p, and p3 and neither can control their communication. Both, however, have some potential for control.

CentralityBlogPost

Calculating betweenness centrality in this (still simple) case requires recourse to probabilities. A probabilistic centrality measure is not necessarily less valuable; however, the concept should give you an idea of the complexities involved in something as ostensibly straightforward as determining which nodes in a network are most “central.” Put into the context of a text network, a lot of intellectual muscle would need to be exerted to convert such a probability measurement into the language of rhetoric and literature (then again, as I write that . . .).

As I said, there is reading to be done, mathematical concepts to comprehend, and debates to be had. And ultimately, what we are after perhaps isn’t centrality measurements at all but metrics for node (word) influence. For example, if we assume (as I think we can) that betweenness centrality is a better metric of node influence than degree centrality, then the .gif above clearly demonstrates that degree centrality may be a relatively worthless metric—it gives you a skewed sense of which words exert the most control over a text. What’s more, node influence is a concept sensitive to scale. Though centrality measurements may inform us about influential nodes across a whole network, they may underestimate the local or temporal influence of less central nodes. Centrality likely correlates with node influence but I doubt it is determinative in all cases. Accessing text (from both a writer’s and a reader’s perspective) is ultimately a word-by-word or phrase-by-phrase phenomenon, so a robust text network analysis needs to consider local influence. A meeting of network analysis and reader response theory may be in order.  Perhaps we are even wrong to expunge functional words from network analysis. As Franco Moretti has demonstrated, analysis of words as seemingly disposable as ‘of’ and ‘the’ can lead to surprising conclusions. We leave these words out of text networks simply because they create messy, spaghetti-monster visualizations. The underlying math, however, will likely be more informative, once we learn how to read it.

Hindi 101

I’m taking Hindi 101 this semester. The Devangari script feels mildly ornate in my hand compared to the angularity of alphabets descended from the Phoenician script (including the English alphabet), but it is quite lovely and not as challenging as I had imagined. It is still an alphabet, after all, with a much closer sound-grapheme correspondence than one finds in English, where each letter—particularly vowels—can correspond to multiple phonemes. (English grammar is absurdly simple compared to all other major languages, but our spelling system must be a nightmare for foreign learners. There’s something to be said for language academies that control the drift between pronunciation and spelling.) Devanagari does, however, omit some vowel sounds and uses secondary or “dependent” vowel forms in most contexts, so it has something of the syllabary about it. In fact, the biggest mistake I make in class is to confuse two dependent vowels,  ी and  ो. The former is long “ee”, the latter is “o”, but in certain fonts (including my own handwriting), they look nearly identical.

The script’s biconsonantal conjuncts are mostly intuitive, though a few bizarre ones need to be memorized as separate graphemes. We have conjuncts in English, but I believe they are a relatively new innovation with limited usage. One example is the city logo of Huntington Beach, California. Hindi has a lot of these, and they are quite common.

clip_image002_0001.201144028_std

An English biconsonantal conjunct.

Apart from learning a new script, the most enjoyable part of Hindi class has been coming across Romance or Germanic cognates. At an intellectual level, I know and have long known that Hindi and English, both Indo-European languages, share a genetic ancestry, which means that at some point in the distant past all Indo-European speakers spoke the same language. It’s easy to get a handle on the concept when talking about Romance languages: Spanish, Italian, and French all used to be Latin. There, we have a well documented history, stretching back through the Renaissance and middle ages to the familiar  world of Rome. However, when it comes to Proto Indo-European, we are faced with a deeper and wider canyon of time and an ancient world that is mostly unknown to us. The PIE speakers were probably living in the Pontic-Caspian steppe lands, but some evidence suggests that they may have been living in the greater Anatolian region; perhaps the most direct descendants of Proto Indo-Europeans are today’s Armenians, Turks, and Persians. They apparently kicked ass and took names because Indo European now stretches from the Pacific to the Indian Oceans.

But whoever they were, the PIE speakers are remote in a way that the Romans or Germanic tribes are not. Yet while doing my Hindi homework, every now and again I come across a word that clearly indicates the ancient linguistic (and genetic) connectedness between the Romans, the Germans, and the Hindi speakers. Kamiz for shirt; mez for table; kamra for room; mata for mother; pita for father; nam for name; darvaza for door . . . In Hindi class, when I say a word out loud that is clearly related to a European word, I am intoning sounds close to the ones that came from the lips of those ancient Indo-Europeans before they split eastward and westward to conquer Eurasia. To language nerds like me, it’s a chilling sensation.

Elliot Rodger’s Manifesto: Text Networks and Corpus Features

Analyzing manifestos is becoming a theme at this blog. Click here for Chris Dorner’s manifesto and here for the Unabomber manifesto.

Manifestos are interesting because they are the most deliberately written and deliberately personal of genres. It’s tenuous to make claims about a person’s psyche based on the linguistic features of his personal emails; it’s far less tenuous to make claims about a person’s psyche based on the linguistic features of his manifesto—especially one written right before he goes on a kill rampage. This one—“My Twisted World,” written by omega male Elliot Rodger—is 140 pages long, and is part manifesto, part autobiography.

I’ve made a lot of text networks over the years—of manifestos, of novels, of poems. Never before have I seen such a long text exhibit this kind of stark, binary division:

RodgersBetweennessCentrality

This network visualizes the nodes with the highest betweenness centrality. The lower, light blue cluster is Elliot’s domestic language; this is where you’ll find words like “friends”, “school,” “house,” et cetera . . . words describing his life in general. The higher, red cluster is Elliot’s sexually frustrated language; this is where you’ll find words like “girls,” “women,” “sex,” “experience,” “beautiful,” “never”  . . . words describing his relationships with (or lack thereof) the feminine half of our species.

It’s quite startling. Although this text is part manifesto and part autobiography, I wasn’t expecting such a clear division: the language Elliot uses to describe his sexually frustrated life is almost wholly severed from the language he uses to describe his life apart from the sex and the “girls” (Elliot uses “girls” far more frequently than he uses “women”—see below). It’s as though Elliot had completely compartmentalized his sexual frustration, and was keeping it at bay. Or trying to. I don’t know how this plays out in individual sections of the manifesto. Nor do I know what it says about Elliot’s mental health more generally. I’ve always believed that compartmentalizing frustrations is, contra popular advice, a rather healthy thing to do. I expected a very, very tortuous and conflicted network to emerge here, indicating that each aspect of Elliot’s life was dripping with sexual angst and misogyny. Not so, it turns out.

Here’s a brief “zoom” on each section:

RodgersDegreeCentralityDomestic

RodgersDegreeCentralityWomen

In the large, zoomed-out network—the first one in the post—notice that the most central nodes are “me” and “my.” I processed the text using AutoMap but decided to retain the pronouns, curious how the feminine, masculine, and personal pronouns would play out in the networks and the dispersion plots. Feminine, masculine, personal—not just pronouns in this particular text. And what emerges when the pronouns are retained is an obvious image of the Personal. Rodgers’ manifesto is brimming with self-reference:

RodgersPronouns

Take that with a grain of salt, of course. In making claims about any text with these methods, one should compare features with the features of general text corpora and with texts of a similar type. The Brown Corpus provides some perspective: “It” is the most frequent pronoun in that corpus; “I” is second; “me” is far down the list, past the third-person pronouns.

Here’s another narcissistic twist, found in the most frequent words in the text. Again,  pronouns have been retained. (Click to enlarge.)

RodgersFreqWords

“I” is the most frequent word in the entire text, coming before even the basic functional workhorses of the English language. The Brown Corpus once more provides perspective: “I” is the 11th most frequent word in that general corpus. Of course, as noted, there is an auto-biographic ethos to this manifesto, so it would be worth checking whether or not other auto-biographies bump “I” to the number one spot. Perhaps. But I would be surprised if “I,” “me,” and “my” all clustered in the top 10 in a typical auto-biography—a narcissistic genre by design, yet I imagine that self-aware authors attempt to balance the “I” with a pro-social dose of “thou.” Maybe I’m wrong. It would be worth checking.

More lexical dispersion plots . . .

Much more negation is seen below then is typically found in texts. According to Michael Halliday, most text corpora will exhibit 10% negative polarity and 90% positive polarity. Elliot’s manifesto, however, bursts with negation. Also notice, below, the constant references to “mother” and “father”—his parents are central characters. But not “mom” and “dad.” I’m from Southern California, born and raised, with social experience across the races and classes, but I’ve never heard a single English-only speaker refer to parents as “mother” and “father” instead of “mom” and “dad.” Was Elliot bilingual? Finally, note that Elliot prefers “girl/s” to “woman/en.”

RodgersGirlsGuys

RodgersMotherFather

RodgersNegation

RodgersSexEtc

Until I discover that auto-biographical texts always drip with personal pronouns, I would argue that Elliot’s manifesto is the product of an especially narcissistic personality. The boy couldn’t go two sentences without referencing himself in some way.

And what about the misogyny? He uses masculine pronouns as often as he uses feminine pronouns; he refers to his father as often as he refers to his mother—although, it is true, the references to mother become more frequent, relative to father, as Elliot pushes toward his misogynistic climax. Overall, however, the rhetorical energy in the text is not expended on females in particular. This is not an anti-woman screed from beginning to end. Also, recall, the preferred term is “girls,” not “women.” Elliot hated girls. Women—middle-aged, old, married, ensconced in careers, not apt to wear bikinis on the Santa Barbara beach—are hardly on Elliot’s radar. (This ageism also comes through in his YouTube videos.) Despite the “I hate all women” rhetorical flourishes at the very beginning and the very end of his manifesto, Elliot prefers to write about girls—young, blonde, unmarried, pre-career, in sororities, apt to wear bikinis on the Santa Barbara beach.

I noticed something similar in the Unabomber manifesto. Not about the girls. About the beginning and ending: what we remember most from that manifesto is its anti-PC bookends, even though the bulk of the manifesto devotes itself to very different subject matter. The quotes pulled from manifestos (including this one) and published by news outlets are a few subjective anecdotes, not the totality of the text .

Anyway. Pieces of writing that sally forth from such diseased individuals always call to mind what Kenneth Burke said about Mein Kampf:

[Hitler] was helpful enough to put his cards face up on the table, that we might examine his hands. Let us, then, for God’s sake, examine them.

 

Demographic distribution: Gender of citations in CCC, RSQ, and RR abstracts

This post follows up on my discussion of citation frequencies in abstracts in rhetoric and composition journals. To reiterate, a safe assumption to make is that citations in abstracts are “central” to the arguments presented and the research undertaken in the articles themselves; they are particularly informative about overall trends. The genre of the humanities article demands more citations than a core argument actually requires, so looking at citations in abstracts should control for that genre requirement, distilling down all citations to the most vital ones.

The journals: College Composition and Communication (CCC), Rhetoric Society Quarterly (RSQ), and Rhetoric Review (RR). The CCC abstracts run from February 2000 (51.3) to September 2011 (63.1), a total of 261 abstracts. The RSQ abstracts run from Winter 2000 (30.1) to Fall 2011 (41.5), a total of 220 abstracts. The RR abstracts run from 2002 (21.3) to 2011 (30.4), a total of 154 abstracts.

The previous post discussed the “long tail” distribution that emerged from the citation frequencies and what it means for disciplinary identity. This post presents information on the gender of the sources cited in the abstracts, then makes a few comments about demographic distributions in general.

There are 79 unique citations in the CCC abstracts; 159 unique citations in the RSQ abstracts; and 121 unique citations in the RR abstracts. (See previous post for .xls data files.) Here’s how the gender distribution falls: in CCC, 23 out of the 79 sources are female; in RSQ, 39 out of the 159 sources are female; in RR, 36 out of the 121 sources are female.

And here are graphs of the raw counts and of the percentages:

Abstract citations by gender (raw count)

Abstract citations by gender (raw count)

Abstract citations by gender (percentage)

Abstract citations by gender (percentage)

In Authoring a Discipline, Maureen Daly Goggin has shown that by 1990 total contributors to 9 of rhetoric and composition’s major journals—including the 3 analyzed here—had equalized to a nearly 50/50 split between males and females. I imagine this trend has continued into the new millennium, but it would be worthwhile to determine whether or not that’s the case.

What has not equalized, however, is the gender contribution in terms of citations. Odds are, counting all citations in the articles themselves would alleviate the large gap seen in the graphs above. But insofar as we accept that abstract citations represent the most vital sources in each journal, then an obvious gender gap still exists in CCC, RSQ, and RR citations.

In RSQ and RR, this gap, in part, likely has something to do with these journals’ tendencies to publish work on rhetorical history. I pointed this out in the last post: 27 (or 22%) of the RR citations are sources from the 17th century or earlier. 26 (or 16%) of RSQ citations are from the same period. Those numbers would grow if they included figures from the 18th and 19th centuries, as well. The reality is, most of these historical sources are male: Plato, Cicero, Aristotle, Quintilian, et cetera.

I have no ready explanation for why CCC citations should have as large a gender gap as the other journals’ citations, given that CCC builds most of its scholarship on sources from the middle part of the 20th century or later. If we look at the 102 most cited figures in CCC between 1987 and 2011 (Mueller, “Grasping”), we discover that 43/102 (42%) of the sources are female: a gender imbalance, but one not nearly as pronounced as the imbalance that surfaces in abstract citations. I’d be curious to see the gender distribution in Mueller’s entire data set. Is there a nearly 50/50 split between male and female sources across all citations in CCC between 1987 and 2011? If so, we could model the gender imbalance in this journal’s citations as an emergent feature: 50/50 across the entire data set; 58/42 in the most popular citations between 1987 and 2011; 71/29 in abstracts between 2000 and 2011. It’s unfortunate that CCC did not publish abstracts until the late 1990s, so that the dates of the abstracts and the articles could be uniform.

The question of demographic balance is one that spills a lot of digital ink. Just this morning, Scott Weingart visualized the gender (im)balance of Digital Humanities Conference attendees: about a 70/30 split that favors males. And Google recently released the demographic characteristics of its workforce: 30% of its employees are women; 17% of its technical employees are women. 60% of its employees are white; 30% of its employees are Asian (read: East Asian and Indian); and only 3% of its employees are Non-Asian Minorities.

I asked Scott why our default assumption should be uniform demographic distribution. When looking at statistical trends that emerge at large scales, we shouldn’t be surprised to discover that human populations cluster differently. At least, that’s my default assumption. The DH Conference draws more males, but then, an Early Childhood Education conference will draw more females. (I once attended a conference on speech and behavior therapy for autistic children; there were no more than three or four males amid about seventy females.) Or take a look at the National Association for the Education of Young Children. Although we often hear about the male-ness of executive boards, the NAEYC’s executive team is entirely female, and its 17-member governing  board boasts 13 females and only 4 males. Looking at all the Early Childhood Education associations and organizations in the country, what gender trends would we expect to find?

The first question to ask about demographic distribution in any particular population (like Google’s workforce or citations in abstracts) is this: What are the characteristics of the larger population from which this particular population is drawing? As long as rhetorical scholars continue to look at rhetorical history, where most of the figures are male, then we can continue to expect many citations in these historical journals to be male. (This may change, however, as more and more rhetorical historians re-discover the history of female oratory.) Or, in Google’s case, if we take the American population as the baseline, assuming a 50/50 gender split, then clearly there is a gender imbalance. But in terms of race and ethnicity, its white workforce is in fact under-represented. Raising the percentage of blacks and Hispanics at Google would mean firing a lot of the Chinese and Indians, unless we want to make whites more under-represented than they already are. (A fairer baseline population would be the percentage of working-age adults in America, or, better yet, the percentage of working-age adults with college degrees; however, those stats are much harder to come by. Total population is a decent but imperfect proxy.)

The point is that we do not always find particular populations boasting a uniform or near-uniform demographic distribution. Why is this? A complex question. Given the totality of the human population (or, more humbly, the totality of any total population in a given geographic area), why do we find the smaller population clusters clustering the way they do around different practices? Why are there more males in CCC citations? Why are there more males at the DH Conference? Why are East Asians and Indians so over-represented at Google? Why are there so few East Asians and Indians in the NFL and the NBA? That populations cluster differently around different practices seems to be a statistical fact. Is it also a future inevitability?

A possible explanation for the emergence of quotative “like” in American English

So Monica was like, “What are you doing here, Chandler?” and Chandler was like, “Uhh nothing” and then Monica was like, “Why are you here with Phoebe?” and Chandler was like, “I don’t know,” and Monica was like, “Whatever!”

Quotative “be like” probably gets on your nerves. Unfortunately for you, it spread like wildfire in the latter half of the 20th century and today is used by native and non-native speakers alike as often as they use traditional say-type quotatives. What is its structure, when did it arise, and why did it spread so quickly? This post offers a possible explanation, based on evidence dragged up from the depths of the Google Books Corpus. To appreciate that evidence, however, we need to start with some discussion of this quotative’s formal properties.

1

One interesting property of quotative “be like” is its ambiguous semantics. In some contexts, it is a stative predicate that denotes internal speech, i.e., thoughts reflexive of an attitude. In other contexts, it is an eventive predicate denoting an actual speech act. Sometimes, the denotation is ambiguous, as in (1):

(1) Monica was like, “Oh my God!”

. . . Did Monica literally say “Oh my God!” or did she just think or feel it?

Another interesting property of quotative “be like” is that it disallows indirect speech.

(2a) Monica was like, “I should go to the mall.”

(2b) *Monica was like that she should go to the mall.

(2c) *Monica was like she should go to the mall

Quotative say of course allows indirect speech:

(3a) Monica said, “I should go to the mall.”

(3b) Monica said that she should go to the mall.

(3c) Monica said she should go to the mall.

Haddican et al. (2012) recognize that quotative “be like” is immune to indirect speech due to its mimetic implicature. (2b) cannot be allowed because quotative “be like” always means something more along these lines:

(4) Monica was like: QUOTE

Given the implied mimesis of this construction, it makes no sense, as in (2b) and (2c), to add an overt complementizer and to change person/tense to produce an indirect, third person report. This property is shared by all uses of quotative “be like,” whether in their stative or eventive readings.

But there’s more to it than a mimetic implicature. Schourup (1982) points out that quotative “go” also shares this mimetic property (although he does not frame it as such). As expected of a quotative with a mimetic implicature, quotative “go” likewise does not allow an indirect speech interpretation via addition of an overt complementizer and shifts in person/tense:

(5a) Monica goes, “I should go to the mall.”

(5b) *Monica goes that she should go to the mall.

Why should these innovative quotatives be so immune to indirect speech and so committed to direct quote marking? Schourup suggests that quotative “go” (and, by extension, quotative “be like”) arose precisely to meet English’s need for a mimetic, unambiguous direct quotation marker. Prior to the occurrence of these new quotatives, English lacked such a marker. Consider (6a) and (6b) below:

(6a) When I talked to him yesterday, Chandler said that you should go to the doctor.

(6b) When I talked to him yesterday, Chandler said you should go to the doctor.

There is no ambiguity in (6a). The speaker of this utterance clearly intends to convey to his interlocutor that Chandler said the interlocutor should go to the doctor. (6b), however, introduces ambiguity. The utterance in (6b) can be interpreted in two ways: a) Chandler said the speaker of the utterance (i.e., I) should go to the doctor; b) Chandler said the speaker’s interlocutor (i.e., you) should go to the doctor. With orthographic conventions, of course, this ambiguity disappears:

(6c) When I talked to him yesterday, Chandler said, “You should go to the doctor.” (So I went.)

However, unlike other languages, spoken English has no “quoting” conventions—it has no direct quote markers for unmarked speech. It is unclear if (6b) is a true quotative or merely an indirect report on speech with a null complementizer.

QuotvsInt

We can imagine speakers needing to clarify this ambiguity:

JOEY: When I talked to him yesterday, Chandler said you should go to the doctor.

ROSS: Wait, he said I should go or you should go?

This ambiguity arises with say-type verbs whenever the complementizer that is omitted. It is traditionally understood that English differentiates between direct quotatives and indirectly reported speech via shifts in person and/or tense. However, the overt complemetizer is really the central feature of this differentiation. Without an overt complementizer, it is never entirely clear if the embedded clause is a direct quote or an indirect report of speech. Here’s another example:

(7) JOEY: Chandler said I will be responsible for the cat’s funeral.

Without the aid of quote marks, we cannot know whether Chandler or Joey is responsible for the cat’s funeral, even though the embedded clause contains a shift in both person and tense. Of course, if Joey wants to convey that Joey himself will be responsible for the cat’s funeral, he can simply add the overt complementizer: “Chandler said that I will be responsible . . .” However, if Joey wants to convey that Chandler has decided to be responsible, Joey has no way to convey it unambiguously with say-type verbs. He must resort to an indirect speech construction with an overt complementizer. Alternatively, he can resort to non-structural signals: a short pause, a change in intonation, or a mimicry of Chandler’s voice. Or he must abandon say-type constructions altogether and convey his meaning some other way.

Quotative “go” and quotative “be like” solve this ambiguity. These innovative quotatives always signal that the following clause is mimetic, a direct quote of speech or thought. Many languages—Russian, Japanese, Georgian, Ancient Greek, to name just a few— have overt markers to ensure that interior clauses are understood as being directly quoted material, whether or not those quoted clauses contain grammatical shifts (though of course they often do). The quotatives “go” and “be like” serve this same purpose. They are structural, unambiguous markers for direct speech, which is why one cannot use them for indirect speech, and which is also why they have spread so widely and quickly: they have met a real need in the language.

Quotative “go,” however, is attested long before quotative “be like.” The Oxford English Dictionary puts the earliest usage in the early 19th century, initially as a way to mime sounds people made, then later as a way to report on actual speech. Here’s an example from Dickens’ Pickwick Papers:

DickensPickwick

So, although I have said that both quotative “be like” and quotative “go” met a need in English for an unambiguous direct quotation marker, it was “go” that in fact met the need first, by at least a century. This historical fact leads me to suspect that quotative “be like” met a slightly different need: while quotative “go” became a direct quotation marker for speech acts, quotative “be like” became a direct quotation marker for thoughts. As Haddican et al. rightly note, an innovative feature of these quotatives is that they allow direct quotes to be descriptors of states. In other words, the directly marked quotes of “go” denote external speech; the directly marked quotes of “be like” primarily denote internal speech, i.e., thoughts or attitudes. I believe this hypothesis is supported by the earliest uses of quotative “be like,” to which we now turn:

2

Today, young native and non-native speakers of English frequently use “like” as a versatile discourse marker or interjection in addition to its use as a quotative (D’Arcy 2005). D’Arcy provides two extreme examples of discourse marker “like.” Both are taken from a large corpus of spoken English:

(8) I love Carrie. LIKE, Carrie’s LIKE a little LIKE out-of-it but LIKE she’s the funniest, LIKE she’s a space-cadet.      Anyways, so she’s LIKE taking shots, she’s LIKE talking away to me, and she’s LIKE, “What’s wrong with you?”

(9) Well you just cut out LIKE a girl figure and a boy figure and then you’d cut out LIKE a dress or a skirt or a coat, and LIKE you’d colour it.

This usage does not become noticeable in available corpora until the 1980s, so nearly all papers that I have read assume that discourse marker “like” and qutoative “be like” arose more or less in tandem during the 1970s, becoming common by the 1980s. However, using the Google Books Corpus, I was able to find an early use of “like” that presages quotative “be like.” This early use also seems to set the stage for the versatile discursive uses of “like” seen in (8) and (9). This early use is the expression, “like wow.” It seems to have arisen during the 1950s (though perhaps earlier) in the early rock n’roll scenes in the Southern United States. Here are some examples.

The first is from 1957: a line from a rock n roll song by Tommy Sands:

(10) When you walk down the street, my heart skips a beat—man, like wow!

The second is from a 1960 issue of Business Education World:

(11) Like, wow! I’m taking a real cool course called general business. It’s the most.

BusinessEducationWorld

The third is from a novel called The Fugitive Pigeon, published in 1965:

(12) But all of a sudden you’re like wow, you know what I mean?

And by 1971, we have a full example of quotative “be like,”— note that this early occurrence uses an expletive as the subject:

(13) But to me it was like, “Oh, why can’t you say, ‘Gee that’s wonderful . . .’”

LifeMagazine1971

These early uses of “like wow” in (10) and (11) denote a stative feeling or attitude rather than any kind of eventive speech act. This is especially clear in (11), where the expression is a direct response to a question about how the speaker is feeling. The quotative in (13) likewise seems to be a stative predicate rather than an eventive one. In fact, in nearly all of the earliest 1uses of quotative “be like”—from the 1970s and early 1980s, as reported in the Google Books Corpus—the intention is to denote a feeling or attitude, not a direct quote of a speech act. Such eventive predications don’t become common until the 1990s and 2000s.

“Like wow,” then, arose in 1950s slang as a stative description. However, the sentence in (14) below suggests that wow was not interpreted as a structurally independent interjection but as an adjective. This is from a 1960 edition of Road and Track magazine:

(14) Man, that crate would look like wow with a Merc grille.

RoadTrack

It is possible that like is an adverb here, but in my estimation it is most likely still a garden variety manner preposition that has innovatively selected for a bare adjective. Typically, like as a preposition only selects NPs as its complement. However, with the advent of “like wow,” it loosened its selection requirements and began to select for adjectives as well. And not just adjectives. The bottom line in this advertisement from Billboard magazine in May 1960 demonstrates that it also began to select for adverbs:

BillboardLikeWowAd

Apparently, in the 1950s and early 1960s, like became a popular and versatile manner preposition. Once like loosened its requirements to select AP complements, it’s easy to see how it could start selecting quotes, thus becoming a new direct quote marker (like narrative “go”); and given the stative denotation of the original phrase “like wow,” it’s also easy to see why stative to be would become the verbal element in this quotative rather than a lexical verb like act or go. Indeed, it appears that the first uses of quotative “be like” were entirely restricted to the phrase “like wow,” ensuring that subsequent uses would likewise have stative readings. (The ad above also shows how easy it would be for like to become an all-around discourse marker once it began to select for a wider range of phrases.)

So, based on the timeline of evidence in the corpus, I posit the following evolution:

LikeEvolution

The emergence of quotative “like”

I follow Haddican et al. in assuming that like in quotative “be like” is still a manner preposition. However, while they assume the preposition did not undergo any change, I argue that like became more versatile in its selection restrictions. This versatility allowed it first to select APs, then to select quotes. Initially, this quotative construction was just an extension of the phrase “like wow,” but it soon began to select any quoted material. And from the beginning, this quotative possessed two features: a) it had an obvious mimetic implicature, ensuring that it would be a direct quote marker, similar to narrative “go”; and b) it had a stative denotation, due to the stative dentation of the original phrase “like wow,” ensuring that the directly marked quotes were reflective of internal speech, i.e., thoughts or attitudes.

A corpus analysis by Buchstaller (2001) has shown that, even today, quotative “go” is much more likely than quotative “be like” to frame “real, occurring speech” (pp. 10); in other words, “be like” continues to be used more often as a stative rather than eventive predicate. As I mentioned earlier, Haddican et al. are correct that one innovative aspect of quotative “be like” is that quotes are now able to be descriptors of states; however, I believe they overstate the eventive vs. stative ambiguity that arises in these quotatives. Most of the time, in real contexts, they are as unambiguously stative as they are unambiguously mimetic of the state. Haddican et al. themselves note that even these eventive readings are open to clarification. Asking whether or not someone “literally” said something sounds much odder following a say-type quotative than a “be like” quotative with a putatively eventive reading.

3

Nevertheless, as I showed at the very beginning of this post, there are instances where quotative “be like” seems to denote an eventive speech act. Linguistically, this is odder than it sounds at first. A single verbal construction—like quotative “be like”—should not have a stative and eventive reading. This ambiguity can only happen for two reasons: either there is some special semantic function at work in this construction, or there are in fact two separate quotative constructions, each with its own syntactic structures.

It is tempting to see a correlation between this ambiguity and the putative ambiguity between stative be and eventive be, also known as the be of activity. Consider the following sentences:

(15) Joey was silly.

(16) Rachel asked Joey to be silly.

Both forms of be select an adjective; however, (16), unlike (15), can be taken to mean that Joey performed some silly action. In other words, the small clause in (16) seems to be an eventive predication, not a stative one. It has been argued (Parsons 1990) that this eventive be is not the usual copular form but a completely different verb that means something like “to act”—in other words, English to be is actually a homophonous pair of verbs, similar to auxiliary have and possessive have. Perhaps this lexical ambiguity in be is related to the eventive vs. stative ambiguity in quotative “be like.” The stative reading arises when stative be is involved; the eventive reading arises when the eventive, lexical be is involved.

Haddican et al. argue against this line of thought. Diachronically, we know that quotative “be like” has arisen rapidly in many varieties of English, and that in all of these varieties, the semantics are ambiguous. But if there are in fact two be verbs that underwent this quotative innovation, then we would need to posit two unrelated channels of change: one in which like+QUOTE became a possible complement of stative be and one in which like+QUOTE became a possible complement of eventive be.

This is actually a problematic claim, given that, presumably, stative and eventive be have different structures. The former undergoes its typical V to T movement in English; the latter, given its eventive semantics, would be expected to remain in the VP like any other lexical verb. These underlying structures would demand that we devise different processes by which qutoative “be like” arose. However, given the rapidity with which it did in fact arise, it is more probable that it arose via a single process—and the inevitable conclusion is that there is a single, stative verb to be that underwent the process. This conclusion is also verified by the auxiliary-like behavior of be in quotatives involving adverbs and questions:

(17) Ross was totally like, “I don’t care!”

(18) Was Ross like, “I don’t care”?

Although the ambiguous stative vs. eventive reading still occurs here, (17) exhibits raising above AdvP, and (18) exhibits subject-aux inversion. In other words, be in these quotatives behaves like an ordinary copular auxiliary, not a lexical verb. We therefore should not posit a separate, eventive be verb. We need another way to explain the semantic ambiguity of these quotatives.

Haddican et al. explain this ambiguity with Davidsonian semantics. Briefly stated, they argue that there is a single stative be verb—both in these qutoative constructions and in English more generally. However, be has a semantic LOCALE function that, in certain contexts, can localize the state in a short-term event, and this localization of an event can force an agentive role onto the subject, even when an adjective has been selected by be. So, in a sentence such as (19), be will have a denotation as in (20):

(19) Joey is being silly.

(20) [[be]] = λSλeλx. ∃s ϵS [e = LOCALE(s) & ARGUMENT(x,e)]

(20) takes a property of state S and localizes it into an event (a moment in which Joey was silly); in the right context, it is not a great leap to coerce this experiencer event into an agentive one. The application of these semantics to “(be) like” quotatives is straightforward:

In the state reading, be like is simply a stage level use of the copula, localised to the event in which the subject of be exhibited the relevant behaviour. The eventive reading arises when the event mapped to is an agentive one, where the most plausible event of an agent behaving in a quotative manner is the relevant speech act. (Haddical et al. 2012 pp. 85)

In short, the ambiguity between stative and eventive “be like” arises from a semantic property that forces certain “states of being” to be processed as localized events whereby the experiencer of the event takes on an agentive role. In certain quotative contexts, the embedded quote is processed as an event, and the subject is understood as having caused that event, i.e, as actually saying something rather than just experiencing an attitude.

I agree that it would be better not to posit two homophonous verbs (stative be vs. be of activity) to account for the ambiguous stative vs. eventive denotations of quotative “be like.” Doing so requires two separate analyses and two separate channels of diffusion, which seems unlikely given the rapidity with which this quotative did in fact spread across many varieties of English. However, Haddican et. al’s application of Davidsonian semantics to explain the ambiguous readings runs into a problem in sentences like (21) below, as well as in the earlier example in (13):

(21) It was like, “Oh Mom, Can I film a movie in the house, it won’t be any problem at all.”

This is clearly an eventive predication of quotative “be like.” But instead of an agentive subject we have expletive it. Recall that Haddican et. al’s analysis relies on the notion that stative be has a LOCALE function that locates the state into a temporary moment or event. This localization can coerce an experiencer subject into the role of an agentive subject when the most likely reading (as above) suggests that the temporary event was an actual speech act. As Haddican et al. say themselves, “this event assigns an agentive role to the subject” (pp. 85). However, by definition, the expletive in (21) receives no theta role and can therefore be neither the experiencer of a state nor the agent of an event. And yet (21) clearly denotes an eventive reading: the speaker actually spoke the words, or something like them.

The fact that “be like” quotatives can take an eventive (or even a stative) reading when an expletive surfaces in spec-TP suggests that Davidsonian semantics do not explain the ambiguous eventive vs. stative readings associated with these quotatives. (The fact that “be like” quotatives exhibit both experiencer subjects and expletive subjects also suggests that the quote CP is the only obligatory argument assigned by “be like.”)

The only alternative seems to be that there are in fact two homophonous be verbs, and quotative “be like” makes use of both. Maybe this isn’t such a big deal. If I’m right about the diachronic process by which quotative “be like” arose, then we can at least see a two-step process: quotative “be like” was solely a stative predicate in its early use and for most of its early history; only later did it begin to be used as an eventive predicate. And if there are in fact two be verbs, the eventive sounds exactly like the stative and is in fact much rarer than the stative, so I suppose one can see how these facts laid the groundwork for the eventual use of stative “be like” as an eventive predicate.