Mikhail Bessonov, Wilhelm-Schickard-Institute for Informatics [HREF1], Department for Computer Engineering (Prof. Dr. W. Rosenstiel) [HREF2], University of Tübingen [HREF3], Sand 13, Tübingen 72076, Germany
The Web of today offers thousands of smaller topic-oriented or regional search engines. The old ones continue to exist and new ones appear in spite of the progress achieved by their generic Web-wide competitors, because they produce better results in their areas of specialisation. However, finding and choosing the best specialised search engines for a particular information need is difficult. This paper suggests an approach to building statistical descriptions of topic-specific document collections that can be stored in and efficiently retrieved from a directory service like LDAP. An algorithmic framework is presented for parallel automatic propagation of queries to several most suitable collections. Two statistical models are compared experimentally within the framework. They differ in the amount of data needed for storage and in the resulting precision of collection selection. The conclusions are applicable to the problems of distributed information retrieval in general.
The problem of distributed search is no longer new in information retrieval, with some pioneering works in the area done in early 90's [10,4,3]. The basic use scenario remains the same (see figure 1): the user sends a query to the broker that propagates it (probably with some modifications) to the collection nodes, collates the search results from each node into a single ordered sequence and presents the latter to the user.
The time and resources required by each node for search are increasing with the number of results obtained from this node. For this reason, it is desirable to obtain no more hits than necessary from only the most suitable collection nodes. For example, if the user wants to see top 100 results, the number of hits obtained from the best 3 collections can be 70, 20, and 10. The are several possible ways of computing the number of hits to be requested from each collection:
We try to extend collection selection techniques to cover the case (3), i.e., efficient parallel querying of several stateless collection nodes. The case is important because it allows for efficient distributed search with very few requirements from the collection nodes. Even very simple subject-specific or regional search engines like HTML-producing CGI scripts can become parts of a larger distributed search system.
More formally, the broker receives a query
and the maximum allowed
number of results
from the user. The broker is aware of a set
of collection nodes
For each collection
the broker must compute the number of documents
to be retrieved from the collection (
means the cardinality
of the set
everywhere in the paper). Most of the elements
are normally zeros, which means that the query is not propagated to
the particular collection. If maximum efficiency in the resource usage
is desired, the condition
must hold. The algorithms
that compute the number of hits to request from each collection node
do more than just collection selection, so they will be called resource
allocation algorithms throughout this paper.
Every collection is required to compute and store in the directory server certain information about the search term distribution in the collection. The exact information stored depends on the statistical model of the collection used; two candidate models are discussed in section 3. It must be however indexed by terms (or other features allowed to appear in the query) present in the documents to allow for fast lookups. This statistical information should be computed during updates of the document collections and posted periodically to the directory service, an approach similar to the one taken in [2].
We intend to perform parallel search in stateless heterogeneous collections,
as described in the scenario (3) in section 1.
At the search time, the stored statistical information for each term
and each collection is obtained from the directory service. The steps
we take to construct the resource allocation vector
are as follows.
A document similarity function
is
chosen (e.g., the classical cosine measure as defined in the vector
space model). The value of the similarity function for a given query
and document is often called document score. The documents
with higher scores are more likely to be relevant to the query. We
treat the chances of being relevant as a conditional probability depending
on the document score. A set of training queries with expert-provided
correct answers can be used to model the probability of relevance
as a function of document score. Many statistical models can be used
to this end. We have chosen the simplest suitable, logistic regression
(see, e.g., [12,13]). According to
the model, the log-odds of relevance are a linear function of the
document score:
A statistical model of term weight distribution among the documents
in the collections is chosen. There are multiple candidates for the
role, having their own drawbacks and advantages. This work considers
two groups of such candidates in sections 3.1 and 3.2.
The models allow for computation of the distribution of document scores
given the query vector
For every collection
we
obtain a probability density function
where
is document score. The collection-dependent parameters
for this function are obtained from the directory service by lookups
that use search terms from the query
Combining the estimated probability density function for scores and
the probability of relevance provided by equation 2,
one can compute the probability of the following event: a randomly
picked document from the collection
is relevant to the query
Binomial distributions can be well approximated by Gaussian distributions
for argument values located near the mean. We need such an approximation
in order to compute the optimal allocation vector
efficiently.
Thus we treat the number
of relevant documents in
collection
as a random value distributed normally with mean
and variance
An approximate solution is, however, trivial to compute. A continuous
approximation can be found by dropping the constraint that
is integer, solving the corresponding convex optimisation problem
and rounding the results. We maximise the logarithm of the product
since sums are easier to deal with. The optimisation problem acquires
the following form:
The approach described in the current section will be used in sections 3.1 and 3.2. Two families of term weight distributions are investigated in these sections, producing two slightly different algorithms for collection selection. The performance of these algorithms is compared and analysed in section 5.
The definition of the probability density function for document scores
was deliberately omitted from section 2. Any number
of such models can be built, we consider two significantly different
ones.
The first model concentrates on the contribution of each term to the overall document score, without storing any information about co-occurence of the terms in the texts. The second model stores less about individual terms, but more about their co-occurence.
The vGlOSS model is an extension of the one proposed in [9].
The original model was designed only to rank collections, but we need
also to estimate how many results
to request from each collection
given the maxim number
of results the user wants to see. Nothing
is changed in the underlying assumptions, we just continue the calculations
presented in the original work. The advantages of the model are its
simplicity and a relatively small amount of information that has to
be stored in the directory service.
Only two values, an integer and a float, are computed and stored for
every term
from each collection
These are the
document frequency
i.e., the number of documents
in the collection
containing at least one occurrence of
the term
and the cumulative term frequency
the sum of the term weights of the term
over all documents
of the collection
According to the maximum correlation
model of vGlOSS, the following assumptions are made about each of
the collections:
After substituting these results into equation 4,
we can for each collection
estimate the mean
and variance
using the formulae
To achieve higher precision in prediction of the number of documents in a collection relevant to a query, one need to have more information about co-occurrence of terms in documents. Keeping a full correlation matrix is infeasible because its size will be the square of the vocabulary size of the collection. A more compact representation is needed, and Latent Semantic Indexing (LSI) can be used to obtain one.
LSI was first introduced in 1990 (see [5]) as an empirical technique. Its foundations lay in linear algebra, in particular in so called Singular Value Decomposition (SVD). Its behaviour is similar in many respects to the Principal Component Analysis (PCA), but LSI can be computed for sparse data ways more efficiently. A solid statistical interpretation was given to LSI in [6], but the resulting formulae seem to be too complex to allow for practical computation. Our statistical model also uses LSI, but is much simpler.
The main LSI formula can be represented for our purposes as
We model the distribution of
as normal multivariate with
a diagonal covariance matrix. The rationale for this is that (orthogonal)
singular vectors
tend to have correlation close to zero
(it did not exceed 0.1 in our experiments).
The distribution mean
can be estimated with the sample
mean:
The directory service must contain
(the rank of LSI) elements
of matrix
per every term (or generally feature) present in the
collection. It must also contain several items of relatively small
size that are not term-related, like the distribution mean
We need some formal criteria to perform experimental evaluation and
comparison of resource allocation algorithms. Following the approach
taken in ([9]), we define a function
that shows how useful is the document
with respect to the query
(the function was applied to collections rather than to individual
documents in the original Gravano's work). There is more than one
way to define the function, leading to different criteria; one of
them is discussed below. Given the merit function, one can define
the total merit
of best
documents in
all available collections w.r.t. the query
simply as a sum of
top
merits of documents taken from the union of all available
collections:
Similarly, one can define the total merit
of best
documents present in the collection
as a sum
of top
merits of document taken only from this collection:
A recall-like measure can be defined as a share of the total merit
available in all collections for
documents, obtained through
the allocation vector
![]()
Different definitions of
lead
to different recall values. The most natural definition is that the
function takes the value of 1 if the document
is relevant to
the query
and 0 otherwise. This corresponds to the relevance
based ranking (RBR) baseline defined in [8]. The
value of the function defined in equation 13 becomes
in this case equal to the recall (in the traditional IR sense) of
a system where each collection node searches perfectly and
top results are obtained from each collection node
By
a perfect node search we mean that if the collection
contains
documents relevant to the query, it returns
relevant documents when
results are requested. The value
of the relevance-based recall measure is that it reflects well the
prediction quality of the number of relevant documents in collections.
Equation 13 defines the recall value for a single query
and for a user-specified maximum number of results
Testing
is normally done with multiple queries, so the recall values get averaged
over all queries for every value of
and presented in a graphical
form.
Old news feeds from the US Federal Broadcast Information Service (FBIS) were used as the source of text documents. The FBIS-3 corpus was split into 20 collections according to the files containing the documents: 12 files with consequent numbers form each collection. The collections contain no duplicates, i.e. no document appears in more than one collection. The total number of documents in all collections was 61578. The numbers of documents in each collection fluctuate between 1185 and 4287. A set of 50 queries was used, the ones from the TREC-6 ad-hoc runs. They are listed in TREC sources as queries 301 to 350.
The performance of both statistical models described above was tested,
with the maximum number of results
requested by the user varying
between 10 and 1000. Recall as defined in equation 13
and discussed in section 4 was calculated for the
recall-based ranking baseline. The LSI rank used was 128. The Recall
graphs are shown in
figure 2.
A method for effective and efficient propagation of queries to collection nodes in a distributed system was presented. It employs a directory service like LDAP for storing collection descriptions that allows for independent updates of these by the collection nodes themselves. The algorithm for collection selection is generic enough to allow for multiple statistical models and can be used in various settings, not only Web search. Two statistical models for collection description were presented and compared. The LSI-based model, in spite of significantly higher storage requirements, have shown only moderate improvements in comparison to a simpler vGlOSS-derived model.
[1] Christoph Baumgarten. "A probabilitstic solution to the selection and fusion problem in distributed information retrieval." In Proc. of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 246-253, Berkeley, CA, USA, August 1999. ACM Press.
[2] M. Bessonov, U. Heuser, I. Nekrestyanov, and A. Patel. "Open architecture for distributed search systems.&qout; Lecture Notes in Computer Science, 1597:55-69, 1999.
[3] D. Boles, M. Dreger, and K. Großjohann. "MeDoC information broker. Harnessing the information in literature and full text databases." In Proc. of the ACM SIGIR'96 Workshop on Networked Information Retrieval, Zurich, Switzerland, August 1996. ACM Press.
[4] James P. Callan, Zhihong Lu, and W. Bruce Croft. "Searching distributed collections with inference networks." In E. A. Fox, P. Ingwersen, and R. Fidel, editors, Proc. of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 21-28, Seattle, WA, USA, July 1995. ACM Press.
[5] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. "Indexing by latent semantic analysis." Journal of the American Society for Information Science, 41(6):391-407, 1990.
[6] Chris H. Q. Ding. "A similarity-based probability model for latent semantic indexing." In Proc. of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 58-65, Berkeley, CA, USA, August 1999. ACM Press.
[7] James C. French, Allison L. Powell, and Jamie Callan. "Effective and efficient automatic database selection." Technical Report CS-99-08, Department of Computer Science, University of Virginia, February 22 1999. Available online at HREF4.
[8] James C. French, Allison L. Powell, Charles L. Viles, Travis Emmitt, and Kevin J. Prey. "Evaluating database selection techniques: a testbed and experiment." In Proc. of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 121-129, Melbourne, Australia, August 1998. ACM Press.
[9] Luis Gravano and Héctor García-Molina. "Generalizing GlOSS to vector-space databases and broker hierarchies." In Umeshwar Dayal, Peter M. D. Gray, and Shojiro Nishio, editors, Proc. of the 21st International Conference on Very Large Data Bases (VLDB'95), pages 78-89, Zurich, Switzerland, 11-15 September 1995. Morgan Kaufmann.
[10] Luis Gravano, Héctor García-Molina, and Anthony Tomasic. "The effectiveness of GlOSS for the text-database discovery problem." In Proc. of the 1994 ACM SIGMOD International Conference on Management of Data, Minneapolis, MN, USA, May 24-27 1994.
[11] Luis Gravano, Héctor García-Molina, and Anthony Tomasic. "GlOSS: Text-source discovery over the Internet." ACM Transactions on Database Systems, 24(2):229-264, June 1999.
[12] Ray R Larson. "A logistic regression approach to distributed IR." In Proc. of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 399-400, Tampere, Finland, 2002. ACM Press.
[13] E. L. Lehmann and G. Gasella. Theory of Point Estimation. Springer Texts in Statistics. Springer-Verlag, second edition, 1998.
[14] National Information Standards Organization. Information Retrieval (Z39.50): Application Service Definition and Protocol Specification (ANSI/NISO Z39.50. I995). NISO Press, Betheda, MD, USA, 1995. Available online at HREF5.