The Pareto distribution of native American language speakers

My post about native American language health gets the most hits on this blog, so I decided to do some minor editorial housekeeping on it last night. While I was fixing awkward syntax, however, I noticed something blatantly obvious about the first graph, which ranks living native languages according to most speakers:


It’s essentially a Pareto distribution, a long tail. I don’t know much about the mathematics underlying it. I only know that it arises naturally across an array of social, geographic, economic, and scientific phenomena. Derek Mueller recently wrote an article about this exact distribution amongst scholarly citations in the field of rhetoric and writing. “Conceptually,” he writes, “the long tail comes from statistics and graphing; it is a feature of a power law or Pareto distribution—graphed patterns that underscore the uneven distribution of some activity or phenomenon” (207). And yet this unequal distribution exists in phenomena as disparate as citations in academic journals and numbers of language speakers.

A power law writ deep in the mathematical fabric of things?

The language/genes metaphor (part 4)

Part IV: The basic building blocks of linguistic replication?

As I mentioned in the last post, I’m convinced that a language/phenotype analogy is more appropriate than a language/genes analogy. However, here’s a second piece of devil’s advocacy because I still believe a case can be made for the latter metaphor.

The language/genes metaphor is appropriate only if we assume that languages are the auditory manifestations of underlying linguistic structures, and that these structures replicate in the mind every time an individual learns a language, either as a first or second language. Entertaining this possibility means accepting Chomsky’s universal grammar hypothesis and his principles and parameters approach.

Does Chomsky’s approach provide some kind of mechanism whereby we can reduce linguistic structures to a few basic parts, the way we can reduce DNA and its replication to nucleotide bases and enzymes?

Yes, I think so. Phrase structure theory and X’ theory provide a framework for analyzing all human languages according to a few basic building blocks: phrases, phrase heads, complements, and specifiers. This is the “DNA” of language.


We don’t need to go into detail about this chart. Basically, all languages are built from phrase heads (X), which project to a phrase-bar (X’), which project to a phrase (XP). Phrase heads can optionally project a complement, and phrases can optionally project specifiers.

languageDNAFour phrase heads map onto—more or less—the grammatical categories we learn in school: verbs, nouns, prepositions, adverbs and adjectives (both categorized as AP). The other phrase heads are less well known. Tense phrases (aka Inflection Phrases) are an abstract category that allows verbs to inflect for tense and person. Complement phrases are projected from embedded clauses: words such as that or because are complementizers. And determiner phrases project from what we commonly call articles: the, a, an in English.

All human languages are built from these categories. All human languages can be analyzed according to the same basic rules of phrase projection. The difference between languages is the difference between various “parameter settings” for these phrases and projections, among other aspects of language structure that I haven’t talked about here.

One of the most salient cross-linguistic differences is, of course, word order. However, according to phrase structure and X’ theory, this difference is simply a matter of re-configuring or re-arranging the structural projections:

The structure of English word order (Subject - Verb - Object)

The structure of English word order
(Subject – Verb – Object)

The structure of Malagasy word order (Verb - Object - Subject)

The structure of Malagasy word order
(Verb – Object – Subject)

The structure of Japanese word order (Subject - Object - Verb)

The structure of Japanese word order
(Subject – Object – Verb)

So, for example, the difference between English and Japanese is simply the difference between an X’ that projects to X first and complement second (English) and an X’ that projects to complement first and X second (Japanese). In other words, English is a head-initial language and Japanese is a head-final language.


In a Chomskyan framework, there are basic building blocks of linguistic structures, and the differences between languages can be described as differences in the configuration and specification of these structures. Sounds roughly comparable to DNA in my opinion, but then, I’m not a geneticist . . .

(Note: the phrase table and the examples in this post are taken from the helpful notes of Dr. John Nissenbaum.)

The language/genes metaphor (part 3)

Part III: Do linguistic structures replicate?

On the Phylogenetic Networks blog, David Morrison has posted an excellent essay regarding the (as he sees it) false analogy between languages and genes. He suggests that a more apt metaphor would be that of languages/phenotypes. He was kind enough to send me an early draft of the post. Reading it was precisely what inspired me to write a few posts of my own on this subject.

Re-reading the essay, I find myself agreeing with its point once again. If we want to find a connection between linguistic and biological evolution, we should probably take a morphological, developmental, typographic approach—in short, a phenotypic approach. Phenotypes are the observable traits of an organism, the expressions of genotypes (or, more accurately, the expressions of genotypes and environmental pressures). So, phenotypes clearly offer a more grounded and potentially productive linguistic analogy than genotypes because a language is a composite of observable traits—from the phonetic level to the semantic level. A language/genes metaphor is askew; it confuses the replicators of hereditary information with the observable expression of that information.

If found to be productive, will the language/phenotype metaphor be as suggestive as the language/genes metaphor in the debate over language origins? I’m not sure. I’m inclined to say yes—pointing still toward an essentially biological view of language—because a productive language/phenotype analogy would also suggest that languages evolve in a way comparable to physical structures. But I’ll leave that question aside for now. In this post (and perhaps the next one), I still want to play devil’s advocate and entertain the possibility of a productive language/genes metaphor.

Clearly, the only way to do this is by assuming the veracity of Chomsky’s universal grammar (UG) and his principles and parameters approach. Assuming the other major linguistic theory—Halliday’s systemic functional grammar—the metaphor doesn’t work at all. It is not too misleading, however, in a Chomskyan framework, to view the phonology, morphology, and syntax of a language in a ‘genetic’ way because this framework assumes that languages are built on underlying structures that could be said to replicate themselves, in that these structures are expressed and understood by humans as language. One can view the relationship, without stretching things too much, as a two-tiered organization:


The language/structure distinction, in generative linguistics, is not as clear or even as important as the phenotype/genotype distinction in biology. But, again, we assume that languages are made possible by their underlying structures, and so it’s not entirely unfair to differentiate them for purposes of exploring this metaphor.

Do linguistic structures replicate themselves? In a sense, yes. In both first and second language acquisition research, evidence suggests that syntactic structures develop gradually. (We typically speak of grammars being ‘acquired,’ but the word denotes a structural development in the mind.) Of course, the development doesn’t begin with anything like sexual (2 parents to offspring) or even asexual (1 parent to offspring) reproduction. During first language acquisition, linguistic structures develop in the mind of a child over the course of about 2-5 years, and we might say the ‘parents’ of this linguistic development are anyone and everyone who communicates linguistically with the child, anyone who passes on structural linguistic information in the form of verbal interaction.

Take the following examples. (Examples come from Hawkins’ Second Language Syntax and Clark’s First Language Acquisition.) The first comes from an interaction between a 2 year old and his mother; the second comes from a second-language learner of English.

languagedevelop1Regardless of language, children and adults first develop lexical categories—nouns, verbs, and adjectives. In (1), the child is acquiring the lexeme mouse and is in the beginning stages of acquiring its plural form, mice. In (2), the adult has acquired basic English negation. At these early stages of development, the language has not “replicated” itself completely. The structures underlying them are unspecified.

languagedevelop2Only with time do the structures become specified: as more information is developed in the new speaker’s mind, the structures become more robust and slowly begin to replicate the fully functional language. For example, compare (2a) with (3) below, She isn’t old. (3) represents a fully developed—a fully specified— inflectional structure. The speaker is no longer working with a NegP and just attaching lexical items to it, as in (2a), She no old; instead, he has developed the category Tense, in conjunction with the ability to inflect and transform verbs. In short, the difference between (2a) and (3) is the difference between a language in the process of replication and a language that has been fully replicated in a speaker’s mind.

languagedevelop4The lexicon of a language contains its lexical categories, individual words with specific conceptual content. Lexicons vary from language to language—dog, der Hund, el perro. Lexical items are often the first things to develop during first or second language acquisition. The lexicon interacts with the language’s morphology to create richer semantic meaning—add –s to pluralize in English, add –en to pluralize in German; add ed to make an English verb past tense. Morphology develops gradually along with a language’s syntax, which connects all these pieces together to form coherent utterances, i.e., sentences.

A lot of information resides in any language; it is acquired piece by piece, starting with simple words and ending with fully formed syntax. The information resides in the minds of advanced speakers, and is transferred to new speakers through verbal interaction. Every time you speak with a child or non-native speaker, you are, in a sense, pollenating their mind with linguistic information which will eventually develop into a fully structured and specified language.

If languages do replicate in a way broadly similar to genes, then we would expect linguistic structures, like genes, to have some kind of uniform replication process. And, in many instances, we do find that linguistic structures replicate, or develop, along similar lines across different languages. (Obviously, most studies of language acquisition have been carried out with the world’s major Romance, Germanic, and Sino languages, so we don’t have anything like a complete picture of how all the world’s languages develop, but the evidence we do have suggests an organized process.)

For example, as discussed in Hawkins, several studies have shown that English verb phrases and noun phrases develop in a similar fashion across speakers, starting with bare VPs and NPs (unspecified lexical categories that simply attach to other lexical categories), moving through comparable levels of specification (e.g., the subsequent acquisition of definite article the and copula be, both of which are the ‘least specified’ specifications*), and ending with third-person singular –s and possessive –‘s.

languagedevelop5*The and copula be are not highly specified because both select for just about anything they want. The can select a singular noun (the girl) or a plural noun (the girls) regardless of the noun’s initial phoneme; in contrast, a can only select a singular noun that begins with a vowel, so the develops before a. Likewise, copula be selects for a noun (is the man), an adjective (is happy), or a preposition (is under the table); in contrast, auxiliary be only selects for a verb ending in –ing (is running), so copula be develops before auxiliary be.

More in the next post . . . I’ll end by noting that I’m only partially convinced by my own argument here about the replication of linguistic structures. As I said at the beginning, I’m more convinced that a language/phenotype analogy is more appropriate.