Abstract |
: |
This work presents a new perspective on characterizing the similarity
between elements of a database or, more generally, nodes of a weighted
and undirected graph. It is based on a Markov-chain model of random
walk through the database. More precisely, we compute quantities (the
average commute time, the pseudoinverse of the Laplacian matrix
of the graph, etc) that provide similarities between any pair of nodes,
having the nice property of increasing when the number of paths connect-
ing those elements increases and when the “length” of paths decreases.
It turns out that the square root of the average commute time is a Eu-
clidean distance and that the pseudoinverse of the Laplacian matrix is a
kernel (its elements are inner products closely related to commute times).
A procedure for computing the subspace projection of the node vectors
of the graph that preserves as much variance as possible in terms of the
commute-time distance – a principal component analysis (PCA) of the
graph – is also introduced. This graph PCA provides a nice interpreta-
tion to the “Fiedler vector”, widely used for graph partitioning. The
model is evaluated on a collaborative-recommendation task where sug-
gestions are made about which movies people should watch based upon
what they watched in the past. Experimental results on the MovieLens
database show that the Laplacian-based similarities (the pseudoinverse of
the Laplacian matrix and the “random-forest matrix”) perform well in
comparison with other methods. The model, which nicely fits into the
so-called “statistical relational learning” framework, could also be usedto compute document or word similarities, and, more generally, it could be applied to machine-learning and pattern-recognition tasks involving a database. |