A proposed query-sensitive similarity measure for information retrieval



Document clustering has been widely used in information retrieval systems in order to improve the efficiency and also the effectiveness of ranked output systems using clustering hypothesis. Based on this hypothesis, documents relevant to a query tend to be highly similar in the context defined by the query. In this way, a pair of documents has an overall similarity (ignoring the query) and a specific similarity (similarity of a pair of documents given a query). A Query-Sensitive Similarity Measure (QSSM) is a mechanism to measure the similarity of two documents given a query. In this paper, in the first step, we identify the sources of information that may be used for this purpose. In the second step, we propose a QSSM based on these information sources. Finally, we propose a parametric QSSM that simultaneously makes use of the product and weighted sum to fuse the information from the identified sources. A genetic algorithm is used to learn the optimal values of parameters in this measure for a specific collection. The leave-one-out method is used to evaluate the proposed learning scheme. Our motivation for this is to see whether the learning scheme can perform significantly better than the measure proposed in the second step. Using several document collections, the performance of each measure is evaluated and the results are compared with other QSSMs proposed in the past research