Vector
space model :The representation of a set of documents as vectors in a
common vector space is known as the vector space model and is
fundamental to a host of information retrieval operations
ranging from scoring documents on a query, document classification and
document clustering.
The
set of documents in a collection then may be viewed as a set of
vectors in a vector space, in which there is one axis for each term.
Cosine
Similarity :
To
quantify the similarity between two documents in the vector space ,we
have different approaches.The first one is to find the magnitude
difference between the document vectors.But this approach has a
drawback like if one document is very large than the other,then the
difference will be large even though they have similar contents.
To
overcome this drawback,the second approach is finding the cosine
similarity which compensates the document length.
The
cosine of two vectors can be easily derived by using the Euclidean
dot product formula:
a.b
= ||a||||b|| cosĪ
cos_sim(d1,d2)
= v(d1).v(d2)/||v(d1)|| ||v(d2)||
For
text matching, the attribute vectors A
and B
are usually the term frequency vectors of the documents. The cosine
similarity can be seen as a method of normalizing document length
during comparison.
In
the case of information retrieval the cosine similarity of two
documents will range from 0 to 1, since the term frequencies (tf-idf
weights) cannot be negative.
Let's
consider the following example,
Document1
- Gilbert: 3
- Hurricane: 2
- Rains: 1
- Storm: 2
- Winds: 2
Document2
- Gilbert: 2
- Hurricane: 1
- Rains: 0
- Storm: 1
- Winds: 2
We
want to know how similar these documents are, purely in terms of word
counts (and ignoring word order).
The
two vectors are, again:
a:
[3,2,1,2,2]
b:
[2,1,0,1,2]
The
cosine of the angle between them is about 0.9439.
By
measuring the angle between the vectors, we can get a good idea of
their similarity , and, to make things even easier, by taking the
Cosine of this angle, we have a nice 0 to 1 (or -1 to 1, depending
what and how we account for) value that is indicative of this
similarity. The smaller the angle, the bigger (closer to 1) the
cosine value, and also the bigger the similarity.This
gives us a great similarity metric with higher values meaning more
similar and lower values meaning less. Therefore, if we compute the
cosine similarity between the query vector and all the document
vectors, sort them in descending order, and select the documents with
top similarity, we will obtain an ordered list of relevant documents
to this query.
No comments:
Post a Comment