Cosine similarity parameters: tf-idf or Boolean?

In a previous post, I used cosine similarity (a “vector space model”) to compare spoken vs. written States of the Union. In this post, I want to see whether and to what extent different metrics entered into the vectors—either a Boolean entry or a tf-idf score—change the results.

First, here’s a brief recap of cosine similarity: One way to quantify the similarity between texts is to turn them into term-document matrices, with each row representing one of the texts and each column representing every word that appears in both of the texts. (The matrices will be “sparse” because each text contains only some of the words across both texts.) With these matrices in hand, it is a straightforward mathematical operation to treat them as vectors in Euclidean space and calculate their cosine similarity with the Euclidean dot product formula, which returns a metric between 0 and 1, where 0 = no words shared and 1 = exact copies of the same text.

. . . But what exactly goes into the vectors in these matrices? Not words from the two texts under comparison, obviously, but numeric representations of the words. The problem is that there are different ways to represent words as numbers, and it’s never clear which is the best way. When it comes to vector space modeling, I have seen two common methods:

The Boolean method: if a word appears in a text, it is represented simply as a 1 in the vector; if a word does not appear in a text, it is represented as a 0.

The tf-idf method: if a word appears in a text, its term frequency-inverse document frequency is calculated, and that frequency score appears in the vector; if a word does not appear in a text, it is represented as a 0.

In my previous post, I used this Python script (compliments to Dennis Muhlstein) which uses the Boolean method. Tf-idf scores control for document length, which is important sometimes, but I wasn’t sure if I wanted to ignore length when analyzing the States of the Union—after all, if a change in medium induces a change in a speech’s length, that’s a modification I’d like my metrics to take note of.

But how different would the results be if I had used tf-idf scores in the term-document matrices, that is, if I had controlled for document length when comparing written vs. spoken States of the Union?

Using Scikit-learn’s TfidfVectorizer and its cosine similarity function (part of the pairwise metrics module), I again calculated the cosine similarity of the written and spoken addresses, but this time using tf-idf scores in the vectors.

The results of both methods—Boolean and tf-idf—are graphed below.


I graphed the (blue) tf-idf measurements first, in decreasing order, beginning with the most similar pair (Nixon’s 1973 written/spoken addresses) and ending with the most dissimilar pair (Eisenhower’s 1956 addresses). Then I graphed the Boolean measurements following the same order. I ended each line with a comparison of all spoken and all written States of the Union (1790 – 2015) copied chronologically into two master files.

In general, both methods capture the same general trend though with slightly different numbers attached to the trend. In a few cases, these discrepancies seem major: With tf-idf scores, Nixon’s 1973 addresses returned a cosine similarity metric of 0.83; with Boolean entries, the same addresses returned a cosine similarity metric of 0.62. And when comparing all written/spoken addresses, the tf-idf method returned a similarity metric of 0.75; the Boolean method returned a metric of only 0.55

So, even though both methods capture the same general trend, tf-idf scores produce results suggesting that the spoken/written pairs are more similar to each other than do the Boolean entries. These divergent results might warrant slightly different analyses and conclusions—not wildly different, of course, but different enough to matter. So which results most accurately reflect the textual reality?

Well, that depends on what kind of textual reality we’re trying to model. Controlling for length obviously makes the texts appear more similar, so the right question to ask is whether or not we think length is a disposable feature, a feature producing more noise than signal. I’m inclined to think length is important when comparing written  vs. spoken States of the Union, so I’d be inclined to use the Boolean results.

Either way, my habit at the moment is to make parameter adjustment part of the fun of data analysis, rather than relying on the default parameters or on whatever parameters all the really smart people tend to use. The smart people aren’t always pursuing the same questions that I’m pursuing as a humanist who studies rhetoric.


Another issue raised by this method comparison is the nature of the cosine similarity metric itself. 0 = no words shared, 1 = exact copies of the same text, but that leaves a hell of a lot of middle ground. What can I say, ultimately, and from a humanist perspective, about the fact that Nixon’s 1973 addresses have a cosine similarity of 0.83 while Eisenhower’s 1956 addresses have a cosine similarity of 0.48?

A few days ago I found and subsequently lost and now cannot re-find a Quora thread discussing common-sense methods for interpreting cosine similarity scores, and all the answers recommended using benchmarks: finding texts from the same or a similar genre as the texts under comparison that are commonly accepted to be exceedingly different or exceedingly similar (asking a small group of readers to come up with these judgments can be a good idea here). So, for example, if using this method on 19th century English novels, a good place to start would be to measure, say, Moby Dick and Pride and Prejudice, two novels that a priori we can be absolutely sure represent wildly different specimens from a semantic and stylistic standpoint.

And indeed, the cosine similarity of Melville’s and Austen’s novels is only 0.24. There’s a dissimilarity benchmark set. At the similarity end, we might compute the cosine similarity of, say, Pride and Prejudice and Sense and Sensibility.

Given that my interest in the State of the Union corpus has more to do with mode of delivery than individual presidential style, I’m not sure how to go about setting benchmarks for understanding (as a humanist) my cosine similarity results—I’m hesitant to use the “similarity cline” apparent in the graph above because that cline is exactly what I’m trying to understand.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s