Thus, the angle between two vectors is given by:
(Document) vectors whose cosine metric is within a threshold (typically
> 0.5) are "retrieved." For the above query, documents d1 and d2
would be
deemed similar (with d2 being twice as similar).
Some points about this model:
(i) Notice that while the dot-product calculation looks cumbersome,
the
number of entries that have non-zeros in both elements of the product would
be very few (these are the ones that would contribute to a non-zero
dot product). Still, this can get compute-intensive for higher dimensions.
(ii) One simplification that people resort to is to normalize
all document vectors (to have length one).
(iii) Retrieval is not "exact" (which is desirable). For example, the
document does not need to have all the terms in the query to "qualify." However,
the vector-space model does not model term-term correlations, so it will
not bring documents that are connected in two-steps together. Thus,
d3 will not be returned
for the query, even though "data structures" have something to do with "theory".
Simple fixes to this problem are the term-weighting approaches, and use of
a thesaurus (sometimes called a lexicon), that explicitly models such
correlations. Still the fundamental problem remains
that in the vector-space model, terms are assumed to be independent and orthogonally
spanning a space. Finally,
(iv) there is a need to compare the query vector with every possible
document vector to ensure that no relevant document is left behind (i.e.,
there is no pre-processing, akin to a lazy approach).
Or, we can think of this as:
The matrix A is said to be diagonalized. Diagonalization can be thought of as a neat mathematical way to bring about the structure of a matrix. S is sometimes called the eigen-vector matrix and the diagonal matrix of eigen values is sometimes called the eigen-value matrix. Diagonalization is also a way of showing the rank of the matrix explicitly (the number of non-zero diagonal elements in the middle matrix). We are thus introduced to our first decomposition.