http://code.activestate.com/recipes/326576/
gives Python2 code for doing this. The idea is simple.
For each language you want to detect, you build a reference model for that language:
- get a reasonably large sample of text in this language.
- from this sample, make a frequency table of the 3-grams in the text.
- save this frequency table as a model for this language.
- make a frequency table of 3-grams (a new language model)
- compute the distance between this model and those of all the reference models you already have.
- classify this sample text as the language of the reference model with the least distance.
In high school, we were introduced to Euclidean distance - a way of measuring how far two points in the (x,y) plane are from each other:
distance(x,y) = sqrt((x1-y1)**2 + (x2-y2)**2) where x = (x1,x2) and y = (y1,y2)
So our garden-variety idea of distance is a function that takes two points (two ordered pairs), and returns a nonnegative number. Math people call a function like this a metric (or distance function, or just a distance) if the function has the following properties:
- coincidence. d(x,y) = 0 if and only if x = y.
- symmetry. d(x,y) = d(y,x).
- triangle inequality. d(x,z) <= d(x,y) + d(y,z).
similarity(A, B) =
total = 0
for each trigram t in both A and B:
fa = A[t] (frequency of trigram t in model A)
fb = B[t] (frequency of trigram t in model B)
total += (fa * fb)
return total / (scalar_length(A) * scalar_length(B))
and
scalar_length(X) =
return square root of sum of squares of all trigram frequencies in X.
The Python code implements roughly the pseudocode above; the trigram frequency tables in the Python are stored as dictionaries of 2-grams whose values are (dictionaries of 1-grams whose values are frequencies).
Rather than provide a distance, the Python code defines a difference between models, defined as
difference(A, B) = 1 - similarity(A,B)
This difference function acts vaguely like a distance, but actually is not one in the precise mathematical sense (see wikipedia article on "cosine similarity" for the reasons why). In practice, though, the difference function acts the way we'd like a distance function to work: two trigram-identical language samples yield a difference value of 0, and two entire dissimilar samples yield a difference value of 1.
Next time, I'll describe a different method for calculating a distance between two models, based on a nonparametric approach called rank distance.
No comments:
Post a Comment