Saturday, May 26, 2007

Vector Space Model

Saturdays are dull in office. Not interested in doing the regular work, i decided to do little bit research. Searching for a specific algorithm, i stumbled upon the good old "vector based model" for search. Thought about brushing up my knowledge on the same and write a comprehensible writeup on it.
In this model the frequency of words is the catch. In the Salton's vector based model both local and global information is considered.

Term Weight = Wi = tFi * log (D/dFi)

Here,
* tFi = term frequency (term counts) or number of times a term i occurs in a document. This is the local (within a document) information.
* dFi = document frequency or number of documents containing term i
* D = number of documents in a database.
* i = the term
dFi/D = is the probability of finding term i in total documents D. Therefore, D/dFi is also called as the "inverse document frequency" or how many times the term i appears in the entire database.

Ok, now lets go through a simple example. Have referred the same from an example found in an online paper.

Lets say, there are 3 documents (D=3) with the given data,

D1 = "the Lotus is in the pond"
D2 = "Garden has a pond"
D3 = "Lotus is a flower in the center"
The query Q = "Lotus Garden Flower"

To start, we must do the following,
  1. Take out all unique words from the three documents and sort them.
  2. Determine how many times each word appears in the respective documents. For eg, here, "Garden" appears 1 time in D2, "Lotus" appears 1 time each in D1 and D3, "the" appears 2 times in D1 and 1 time in D3.
    Therefore,
    The term frequency tFi for "Lotus" in D1 is 1, D2 is 0 and D3 is 1.
    The document frequency dFi for "Lotus" is 2 (found in D1 and D3).
    Document frequency (D/dFi) for "Lotus" is 3/2=1.5
    Inverse document frequency (IDF=log(D/dFi)) for "Lotus" = 0.18
    IDF is a global value for each term.
  3. Now that we have the parameters for the formula, we can calculate the weight for term 't' in each documents 'd'.
    Wt,d = tFi * IDF
    So (Wt,d) for "Lotus" for D1 = 1 * 0.1761 = 0.1761
    (Wt,d) for "Lotus" for D2 = 0 * 0.1761 = 0
    Similarly, (Wt,d) for all documents and the query is found.
    (Wt,d) for "Lotus" in query Q = tFi * IDF = 1 * 0.1761 = 0.1761
    Pls note, if the term frequency (tFi) is 0, then the (Wt,d) is also 0.
    Thus we have weights (Wt,d) for Q, D1, D2 and D3.



  4. Now, we find out the vector lengths.



  5. Now, we calculate all dotproducts except all zero products




  6. Now, we calculate the similarity values



  7. In the end, we will sort on the cosine values in descending order and the document in the top will be the highest ranked. In this case, the rank will be

    Rank1 : D2 = 0.39
    Rank2 : D3 = 0.32
    Rank3 : D1 = 0.15
The above explained example is a simple example just to demonstrate the vector space model. In a real search engine, there are other factors which needs to be considered. The important ones are -
  • Stop words - While extracting all unique terms from the documents, the prepositions are removed. Here for eg - a, has, in, is, the, where. Also special characters like ',' '.' '-' etc can be handled.
  • Handle different forms of the word. As here a pure keyword search is done, trimming the term down to its root word and then matching it to the query will be a good idea. A good stemming algorithm will help. Stemming will help in matching all words like develop, developer, developers etc.
  • It is also important to know where within a document a term is found.
The vector space concept is widely used in search engines. MySQL is one good example which is inspired of this model and has implemented it in its fulltext engine.

In the MySQL implementation an application uses database tables consisting of N rows. Each row corresponds to a document, where

  1. Li, j = (log(dtf)+1)/sumdtf; i.e., local information based on logarithmic term counts
  2. Gi = log((N-nf)/nf); i.e., global information based on probabilistic IDF
  3. Nj = U/(1+0.0115*U); i.e., normalization based on pivoting

Here,

dtf = number of times the term appears in the document (row)
sumdtf = sum of (log(dtf)+1)'s for all terms in the same document (row)
U = number of unique terms in the document (row)
N = total number of document rows
nf = number of documents (rows) containing the term

The following equation is thus derived,
w = (log(dtf)+1)/sumdtf * U/(1+0.0115*U) * log((N-nf)/nf)

Well, now that we have understanding of this model, we can always tweak it as per our use and derive a new algorithm. Gonna be fun :-)

No comments: