Thursday, April 24, 2014
The Classic Blunders
I was thinking today about The Classic Blunders, as described by Fezzini.
1. Turnover isn't free. So much knowledge never gets written down; it's domain knowledge or experience that only exists in the mind of the person. When the person walks out the door, all that know-how goes with him. (And that's just the smallest cost. Pour encourager les autres is a great way to utterly demoralize les autres.)
2. Prickliness isn't free. When you have prickly people you have to deal with, day in and day out, it wears on the soul.
3. The Mission Isn't Everything. When The Mission Becomes Everything, it means people are willing to sacrifice all the essential parts of a reasonably happy and healthy life - like health, family, a sense of balance - In Order To Make The Mission Happen.
Saturday, March 15, 2014
A tale of two distances (part 2)
Last time, I wrote about using cosine similarity as a way to tell how alike or different two trigram language models were. This post describes a second distance-like measure, from Cavnar and Trenkle's 1994 paper, "N-Gram-Based Text Categorization", which can be found on the web. The way Cavnar and Trenkle (C&T) compute the distance between models is not via cosine similarity, but by using the ranks of the sorted trigram frequencies.
Here's a simple example of C&T's method: suppose you have two lists of trigrams, and we want to make a distance measure that will tell us how alike the lists are.
list X: RAN CAT CAT THE THE THE RED
list Y: THE CAT RAN RED RAN CAT RED
We compute the frequencies of the trigrams in the lists, sort the lists in descending frequency, rank the resulting lists, and compute the sum of the differences of the ranks. As worked out:
X freqs RAN=1 CAT=2 THE=3 RED=1
sort down by freq THE CAT RAN RED
rank 1 2 3 4
Y freqs THE=1 CAT=2 RAN=2 RED=2
sort down by freq CAT RAN RED THE
rank 1 2 3 4
rank diffs THE(1-4) CAT(2-1) RAN(3-2) RED(4-3)
absolute diffs 3 1 1 1
sum of abs diffs 6 = 3+1+1+1
Here, lists X and Y are trigrams. We compute the differences between the ranks of the lists as sorted in descending order by frequence, then take the absolute values of these differences, and sum these absolute values. The resulting sum (6 in this example) is a measure of the mismatch between the two lists.
Cavnar and Trenkle call this measure of mismatch an "out-of-place measure". In this posting, I call the measure of mismatch the rank distance between the two lists.
When an element is in both lists, the magnitude of the rank differences for that element tells us how far out of place that element is. When an element is not in both lists, we arbitrarily set the rank difference to N (the length of the lists). The values of rank distance vary from 0 (both lists have the same elements with the same frequency rankings) to N*N (the two lists share no common elements).
For two trigram models (from a reference text and a sample text), we can use the Cavnar & Trenkle out-of-place measure as a distance-like way to test the similarity between two models -- lower rank distance implies greater similarity.
The process described above is not exactly what Cavnar & Trenkle proposed. C&T collect frequencies for all N-grams (N=1,2,3,4,5), not just trigrams and take the top 400 most-frequent N-grams in descending frequency order as their language models. I've tested a simplified variant of C&T (only use trigrams, take the top 300 for a language model) against the Python Recipes code, and the quality of classification results are similar, when used just for language detection.
For reference, the C&T 1994 paper is available from
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.162.2994
Here's a simple example of C&T's method: suppose you have two lists of trigrams, and we want to make a distance measure that will tell us how alike the lists are.
list X: RAN CAT CAT THE THE THE RED
list Y: THE CAT RAN RED RAN CAT RED
We compute the frequencies of the trigrams in the lists, sort the lists in descending frequency, rank the resulting lists, and compute the sum of the differences of the ranks. As worked out:
X freqs RAN=1 CAT=2 THE=3 RED=1
sort down by freq THE CAT RAN RED
rank 1 2 3 4
Y freqs THE=1 CAT=2 RAN=2 RED=2
sort down by freq CAT RAN RED THE
rank 1 2 3 4
rank diffs THE(1-4) CAT(2-1) RAN(3-2) RED(4-3)
absolute diffs 3 1 1 1
sum of abs diffs 6 = 3+1+1+1
Here, lists X and Y are trigrams. We compute the differences between the ranks of the lists as sorted in descending order by frequence, then take the absolute values of these differences, and sum these absolute values. The resulting sum (6 in this example) is a measure of the mismatch between the two lists.
Cavnar and Trenkle call this measure of mismatch an "out-of-place measure". In this posting, I call the measure of mismatch the rank distance between the two lists.
When an element is in both lists, the magnitude of the rank differences for that element tells us how far out of place that element is. When an element is not in both lists, we arbitrarily set the rank difference to N (the length of the lists). The values of rank distance vary from 0 (both lists have the same elements with the same frequency rankings) to N*N (the two lists share no common elements).
For two trigram models (from a reference text and a sample text), we can use the Cavnar & Trenkle out-of-place measure as a distance-like way to test the similarity between two models -- lower rank distance implies greater similarity.
The process described above is not exactly what Cavnar & Trenkle proposed. C&T collect frequencies for all N-grams (N=1,2,3,4,5), not just trigrams and take the top 400 most-frequent N-grams in descending frequency order as their language models. I've tested a simplified variant of C&T (only use trigrams, take the top 300 for a language model) against the Python Recipes code, and the quality of classification results are similar, when used just for language detection.
For reference, the C&T 1994 paper is available from
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.162.2994
Friday, March 14, 2014
A tale of two distances
Suppose you have a blob of foreign text, and you want to identify what language the text is written in. At the Python Recipes web site, the URL
http://code.activestate.com/recipes/326576/
gives Python2 code for doing this. The idea is simple.
For each language you want to detect, you build a reference model for that language:
In high school, we were introduced to Euclidean distance - a way of measuring how far two points in the (x,y) plane are from each other:
distance(x,y) = sqrt((x1-y1)**2 + (x2-y2)**2) where x = (x1,x2) and y = (y1,y2)
So our garden-variety idea of distance is a function that takes two points (two ordered pairs), and returns a nonnegative number. Math people call a function like this a metric (or distance function, or just a distance) if the function has the following properties:
similarity(A, B) =
total = 0
for each trigram t in both A and B:
fa = A[t] (frequency of trigram t in model A)
fb = B[t] (frequency of trigram t in model B)
total += (fa * fb)
return total / (scalar_length(A) * scalar_length(B))
and
scalar_length(X) =
return square root of sum of squares of all trigram frequencies in X.
The Python code implements roughly the pseudocode above; the trigram frequency tables in the Python are stored as dictionaries of 2-grams whose values are (dictionaries of 1-grams whose values are frequencies).
Rather than provide a distance, the Python code defines a difference between models, defined as
difference(A, B) = 1 - similarity(A,B)
This difference function acts vaguely like a distance, but actually is not one in the precise mathematical sense (see wikipedia article on "cosine similarity" for the reasons why). In practice, though, the difference function acts the way we'd like a distance function to work: two trigram-identical language samples yield a difference value of 0, and two entire dissimilar samples yield a difference value of 1.
Next time, I'll describe a different method for calculating a distance between two models, based on a nonparametric approach called rank distance.
http://code.activestate.com/recipes/326576/
gives Python2 code for doing this. The idea is simple.
For each language you want to detect, you build a reference model for that language:
- get a reasonably large sample of text in this language.
- from this sample, make a frequency table of the 3-grams in the text.
- save this frequency table as a model for this language.
- make a frequency table of 3-grams (a new language model)
- compute the distance between this model and those of all the reference models you already have.
- classify this sample text as the language of the reference model with the least distance.
In high school, we were introduced to Euclidean distance - a way of measuring how far two points in the (x,y) plane are from each other:
distance(x,y) = sqrt((x1-y1)**2 + (x2-y2)**2) where x = (x1,x2) and y = (y1,y2)
So our garden-variety idea of distance is a function that takes two points (two ordered pairs), and returns a nonnegative number. Math people call a function like this a metric (or distance function, or just a distance) if the function has the following properties:
- coincidence. d(x,y) = 0 if and only if x = y.
- symmetry. d(x,y) = d(y,x).
- triangle inequality. d(x,z) <= d(x,y) + d(y,z).
similarity(A, B) =
total = 0
for each trigram t in both A and B:
fa = A[t] (frequency of trigram t in model A)
fb = B[t] (frequency of trigram t in model B)
total += (fa * fb)
return total / (scalar_length(A) * scalar_length(B))
and
scalar_length(X) =
return square root of sum of squares of all trigram frequencies in X.
The Python code implements roughly the pseudocode above; the trigram frequency tables in the Python are stored as dictionaries of 2-grams whose values are (dictionaries of 1-grams whose values are frequencies).
Rather than provide a distance, the Python code defines a difference between models, defined as
difference(A, B) = 1 - similarity(A,B)
This difference function acts vaguely like a distance, but actually is not one in the precise mathematical sense (see wikipedia article on "cosine similarity" for the reasons why). In practice, though, the difference function acts the way we'd like a distance function to work: two trigram-identical language samples yield a difference value of 0, and two entire dissimilar samples yield a difference value of 1.
Next time, I'll describe a different method for calculating a distance between two models, based on a nonparametric approach called rank distance.
Subscribe to:
Posts (Atom)