Distributed search using subject-specific Web search engines

Mikhail Bessonov, Wilhelm-Schickard-Institute for Informatics [HREF1], Department for Computer Engineering (Prof. Dr. W. Rosenstiel) [HREF2], University of Tübingen [HREF3], Sand 13, Tübingen 72076, Germany

Abstract

The Web of today offers thousands of smaller topic-oriented or regional search engines. The old ones continue to exist and new ones appear in spite of the progress achieved by their generic Web-wide competitors, because they produce better results in their areas of specialisation. However, finding and choosing the best specialised search engines for a particular information need is difficult. This paper suggests an approach to building statistical descriptions of topic-specific document collections that can be stored in and efficiently retrieved from a directory service like LDAP. An algorithmic framework is presented for parallel automatic propagation of queries to several most suitable collections. Two statistical models are compared experimentally within the framework. They differ in the amount of data needed for storage and in the resulting precision of collection selection. The conclusions are applicable to the problems of distributed information retrieval in general.

1 Introduction and related works

The problem of distributed search is no longer new in information retrieval, with some pioneering works in the area done in early 90's [10,4,3]. The basic use scenario remains the same (see figure 1): the user sends a query to the broker that propagates it (probably with some modifications) to the collection nodes, collates the search results from each node into a single ordered sequence and presents the latter to the user.

\includegraphics[%%
width=0.50\textwidth,
keepaspectratio]{/home/mikeb/doc/papers/mar2004/dd-strategy.eps}
Figure 1: Distributed search scenario

The time and resources required by each node for search are increasing with the number of results obtained from this node. For this reason, it is desirable to obtain no more hits than necessary from only the most suitable collection nodes. For example, if the user wants to see top 100 results, the number of hits obtained from the best 3 collections can be 70, 20, and 10. The are several possible ways of computing the number of hits to be requested from each collection:

  1. Some collections can return meta information about the search results (like the total number of hits) fast, without doing a full search. The search strategy in this case may include ranking of the collections with respect to the query, concurrently sending the query to several top ranked collections, obtaining meta information about the search results, and computing the needed number of results from each collection. In this case, it is sufficient just to order the candidate collections by the estimated number of documents they contain. There is no need in estimating how many documents to request from each collection. The collections that support such scenario are quite complex and implement stateful protocols like Z39.50 ([14]).
  2. Other collections are stateless and can just return search results, up to the maximum number specified in the search request. This is the usual way how search is implemented over HTTP. In the simplest case, one can also get away by only ranking the collections and requesting the same number of documents (e.g., the maximum number requested by the user) from each of the top ranked collections. This is however wasteful in terms of query processing time and resources, because several times more hits than strictly necessary are requested from each collection node to which the query was propagated.
  3. Finally, one can estimate the actual number of relevant documents stored at each collection node, and request an individual number of hits from each node based on this estimate. This scenario allows for a rather simple stateless collection node, but requires most intelligence from the broker. The current paper focuses on this particular scenario.
Most work in distributed information retrieval has concentrated on just ranking the collections by relevance to a particular query, without computing the number of documents to be requested from each collection ([1,11,4,7]). A number of metrics somewhat similar to recall and precision in a single collection search was developed for collection ranks ([8]). They are suitable for the scenarios (1) and (2) from the list above, but are not designed to be used in the scenario (3).

We try to extend collection selection techniques to cover the case (3), i.e., efficient parallel querying of several stateless collection nodes. The case is important because it allows for efficient distributed search with very few requirements from the collection nodes. Even very simple subject-specific or regional search engines like HTML-producing CGI scripts can become parts of a larger distributed search system.

More formally, the broker receives a query $q$ and the maximum allowed number of results $N$ from the user. The broker is aware of a set of collection nodes  $C=\left\{ c_{i}\right\} \,.$ For each collection $c_{i}$ the broker must compute the number of documents  $p_{i},\, i\in1\ldots\left\vert C\right\vert$ to be retrieved from the collection ( $\left\vert C\right\vert$ means the cardinality of the set $C$ everywhere in the paper). Most of the elements $p_{i}$ are normally zeros, which means that the query is not propagated to the particular collection. If maximum efficiency in the resource usage is desired, the condition  $\sum_{i}p_{i}\leq N$ must hold. The algorithms that compute the number of hits to request from each collection node do more than just collection selection, so they will be called resource allocation algorithms throughout this paper.

2 Approach

Every collection is required to compute and store in the directory server certain information about the search term distribution in the collection. The exact information stored depends on the statistical model of the collection used; two candidate models are discussed in section 3. It must be however indexed by terms (or other features allowed to appear in the query) present in the documents to allow for fast lookups. This statistical information should be computed during updates of the document collections and posted periodically to the directory service, an approach similar to the one taken in [2].

We intend to perform parallel search in stateless heterogeneous collections, as described in the scenario (3) in section 1. At the search time, the stored statistical information for each term and each collection is obtained from the directory service. The steps we take to construct the resource allocation vector $p$ are as follows.

A document similarity function $\sigma=\mathtt{sim}\,(q,\, d)$ is chosen (e.g., the classical cosine measure as defined in the vector space model). The value of the similarity function for a given query and document is often called document score. The documents with higher scores are more likely to be relevant to the query. We treat the chances of being relevant as a conditional probability depending on the document score. A set of training queries with expert-provided correct answers can be used to model the probability of relevance as a function of document score. Many statistical models can be used to this end. We have chosen the simplest suitable, logistic regression (see, e.g., [12,13]). According to the model, the log-odds of relevance are a linear function of the document score:

\begin{displaymath}
\ln\frac{1-P\left\{ \mathtt{relevant}\left\vert\sigma\right....
...levant}\left\vert\sigma\right.\right\} }=\theta-\beta\sigma\;,
\end{displaymath} (1)

where $\sigma$ is a document score, $P\left\{ \mathtt{relevant}\left\vert\sigma\right.\right\} $ is the probability of document relevance to the query given that the document score is equal to $\sigma$, and the constants $\theta$ and $\beta$ are estimated from the training data using the generalised least squares method. The training data are gathered and processed independently for each collection, so no communication between the collections is necessary in order to compute the constants. The probability of relevance can be expressed explicitly as
\begin{displaymath}
P\left\{ \mathtt{relevant}\left\vert\sigma\right.\right\} =\frac{1}{1+\exp\left(\theta-\beta\sigma\right)}\;.
\end{displaymath} (2)

A statistical model of term weight distribution among the documents in the collections is chosen. There are multiple candidates for the role, having their own drawbacks and advantages. This work considers two groups of such candidates in sections 3.1 and 3.2. The models allow for computation of the distribution of document scores given the query vector $q\,.$ For every collection $c_{i}$ we obtain a probability density function $\mathtt{scorepdf}\left(\sigma\left\vert q,\, c_{i}\right.\right)\;,$ where $\sigma$ is document score. The collection-dependent parameters for this function are obtained from the directory service by lookups that use search terms from the query $q\,.$

Combining the estimated probability density function for scores and the probability of relevance provided by equation 2, one can compute the probability of the following event: a randomly picked document from the collection $c_{i}$ is relevant to the query $q\,:$

\begin{displaymath}
P\left\{ \mathtt{relevant}\left\vert q,\, c_{i}\right.\right...
...ight.\right)}{1+\exp\left(\theta-\beta\sigma\right)}d\sigma\;.
\end{displaymath} (3)

Similarly, one can compute the expectation $\mu_{i}$ and variance $\xi_{i}^{2}$ for the number of relevant documents in the collection $c_{i}\,:$
\begin{displaymath}
\begin{array}{l}
\mu_{i}=\left\vert c_{i}\right\vert\int_{-\...
...xp\left(\theta-\beta\sigma\right)\right)^{2}}d\sigma\end{array}\end{displaymath} (4)

The formula above assumes that relevance or irrelevance of each document is independent of relevance or irrelevance of other documents to the query. Thus statistics for binomial distribution can be employed.

Binomial distributions can be well approximated by Gaussian distributions for argument values located near the mean. We need such an approximation in order to compute the optimal allocation vector $p$ efficiently. Thus we treat the number $\mathtt{nrel}$ of relevant documents in collection $c_{i}$ as a random value distributed normally with mean $\mu_{i}$ and variance  $\xi_{i}^{2}\,:$

\begin{displaymath}
P\left\{ \mathtt{nrel}\geq k\left\vert q,\, c_{i}\right.\rig...
...}\exp\left(-\frac{\left(x-\mu_{i}\right)^{2}}{2\xi_{i}}\right)
\end{displaymath} (5)

An optimal allocation vector $p$ maximises the probability that all the requested documents are relevant, i.e., maximises $\prod_{i}P\left\{ \mathtt{nrel}\geq p_{i}\left\vert q,\, c_{i}\right.\right\} \:,$ while satisfying the constraints $p_{i}\geq0\,,$ $p_{i}$ is integer, $\sum_{i}p_{i}=N\,.$ [*] This problem allows for a direct numeric solution, but the latter can be too computationally expensive when the number of collections is very large.

An approximate solution is, however, trivial to compute. A continuous approximation can be found by dropping the constraint that $p_{i}$ is integer, solving the corresponding convex optimisation problem and rounding the results. We maximise the logarithm of the product since sums are easier to deal with. The optimisation problem acquires the following form:

\begin{displaymath}
\begin{array}{l}
\sum_{i=1}^{I}\ln\left(1-\int_{-\infty}^{\w...
...1}^{I}\widetilde{p_{i}}=N\\
\widetilde{p_{i}}\in\Re\end{array}\end{displaymath} (6)

It can be demonstrated by using the Lagrange multipliers that when $N\geq\sum_{i=1}^{I}\mu_{i}$ holds, the optimal solution for the continuous problem is
\begin{displaymath}
\widetilde{p_{i}}=\mu_{i}+\xi_{i}\frac{N-\sum_{j=1}^{I}\mu_{j}}{\sum_{j=1}^{I}\xi_{j}}\;,\: i=1,\,\ldots,\, I
\end{displaymath} (7)

For smaller values of $N$ some of the  $\widetilde{p_{i}}$ values may become negative. The solution for this case can be obtained by setting such  $\widetilde{p_{i}}$ to 0 and excluding the corresponding  $\mu_{i}\,,\:\xi_{i}$ from the sums in equation 7. After the exclusion $\widetilde{p_{i}}\geq0$ holds for every $i$ and $\sum_{i=1}^{I}\widetilde{p_{i}}=N\,.$ Rounding must be done with care in order to keep the sum of $p_{i}$ equal to $N\,.$ This can be achieved by sorting  $\widetilde{p_{i}}$ in the descending order and assigning $p_{i}=\mathtt{ceil}\left(\widetilde{p_{i}}\right)$ while $\sum_{i=1}^{I}p_{i}\leq N\,.$ The rest of $p_{i}$ are zeros.

The approach described in the current section will be used in sections 3.1 and 3.2. Two families of term weight distributions are investigated in these sections, producing two slightly different algorithms for collection selection. The performance of these algorithms is compared and analysed in section 5.


3 Statistical models

The definition of the probability density function for document scores  $\mathtt{scorepdf}\left(\sigma\left\vert q,\, c_{i}\right.\right)$ was deliberately omitted from section 2. Any number of such models can be built, we consider two significantly different ones.

The first model concentrates on the contribution of each term to the overall document score, without storing any information about co-occurence of the terms in the texts. The second model stores less about individual terms, but more about their co-occurence.

3.1 vGlOSS

The vGlOSS model is an extension of the one proposed in [9]. The original model was designed only to rank collections, but we need also to estimate how many results $p_{i}$ to request from each collection $c_{i}$ given the maxim number $N$ of results the user wants to see. Nothing is changed in the underlying assumptions, we just continue the calculations presented in the original work. The advantages of the model are its simplicity and a relatively small amount of information that has to be stored in the directory service.

Only two values, an integer and a float, are computed and stored for every term $t_{j}$ from each collection $c_{i}\,.$ These are the document frequency  $\mathtt{df}_{ij}\,,$ i.e., the number of documents in the collection $c_{i}$ containing at least one occurrence of the term $c_{i}\,,$ and the cumulative term frequency $w_{ij}\,,$ the sum of the term weights of the term $t_{j}$ over all documents of the collection $c_{i}\,.$ According to the maximum correlation model of vGlOSS, the following assumptions are made about each of the collections:

  1. If query terms $t_{1}$ and $t_{2}$ appear in $\mathtt{df}_{i1}$ and $\mathtt{df}_{i2}$ documents in collection $c_{i}\,,$ respectively, and $\mathtt{df}_{i1}\leq\mathtt{df}_{i2}\,,$ then every document $d\in c_{i}$ that contains $t_{1}$ also contains $t_{2}\,.$
  2. The weight of a term is distributed uniformly over all documents that contain the term.
The assumptions are clearly unrealistic, but have performed rather well in the experiments (see [9]). One possible explanation is that the word combinations occuring in the user queries are often typical for the subject area of the query, so the same combinations occur in the documents very frequently too. Suppose terms  $t_{1},\,\ldots,\, t_{S}$ are used in the query $q\,,$ with corresponding $\mathtt{df}_{i1}\leq\mathtt{df}_{i2}\leq\ldots\leq\mathtt{df}_{iS}\,.$ Assume that $\mathtt{df}_{i1}>0\,.$ From assumption 1, the  $\mathtt{df}_{i1}$ documents in collection $c_{i}$ that contain term $t_{1}$ also contain all the remaining $S-1$ query terms. From assumption 2, the similarity of any of these  $\mathtt{df}_{i1}$ documents to the query $q$ is  $\sigma_{i1}=\sum_{s=1}^{S}q_{s}\frac{w_{is}}{\mathtt{df}_{is}}\,,$ where $q_{s}$is the weight of term $t_{s}$ in the query $q\,.$ Furthermore, these  $\mathtt{df}_{i1}$ documents have the highest similarity to $q$ among the documents in $C_{i}\,.$ According to assumption 1, the next $\mathtt{df}_{i2}-\mathtt{df}_{i1}$ documents contain terms $t_{2},\,\ldots,\, t_{S}$ but not $t_{1}\,.$ Thus, it is possible to calculate $\sigma_{i2},\,\ldots,\,\sigma_{iS}$ according to the formula  $\sigma_{ij}=\sum_{s=j}^{S}q_{s}\frac{w_{is}}{\mathtt{df}_{is}}\,.$ Obviously, $\sigma_{i1}>\sigma_{i2}>\ldots>\sigma_{iS}\,.$ Thus, collection $c_{i}$ contains $n_{i1}=\mathtt{df}_{i1}$ documents with score $\sigma_{i1}\,,$ $n_{i2}=\mathtt{df}_{i2}-\mathtt{df}_{i1}$ documents with score $\sigma_{i2}\,,$ and so on until $n_{iS}=\mathtt{df}_{iS}-\mathtt{df}_{iS-1}$ documents with score $\sigma_{iS}\,.$ The remaining documents are assumed to have the score of 0.

After substituting these results into equation 4, we can for each collection $c_{i}$ estimate the mean $\mu_{i}$ and variance $\xi_{i}^{2}$ using the formulae

\begin{displaymath}
\begin{array}{l}
\mu_{i}=\sum_{s=1}^{S}\frac{n_{is}}{1+\exp\...
...\left(\theta-\beta\sigma_{is}\right)\right)^{2}}\end{array}\;.
\end{displaymath} (8)

Further computation is done starting from the equation 5 as described in section 2.

3.2 LSI

To achieve higher precision in prediction of the number of documents in a collection relevant to a query, one need to have more information about co-occurrence of terms in documents. Keeping a full correlation matrix is infeasible because its size will be the square of the vocabulary size of the collection. A more compact representation is needed, and Latent Semantic Indexing (LSI) can be used to obtain one.

LSI was first introduced in 1990 (see [5]) as an empirical technique. Its foundations lay in linear algebra, in particular in so called Singular Value Decomposition (SVD). Its behaviour is similar in many respects to the Principal Component Analysis (PCA), but LSI can be computed for sparse data ways more efficiently. A solid statistical interpretation was given to LSI in [6], but the resulting formulae seem to be too complex to allow for practical computation. Our statistical model also uses LSI, but is much simpler.

The main LSI formula can be represented for our purposes as

\begin{displaymath}
A\approx\left(u_{1},\, u_{2},\,\ldots,\, u_{r}\right)\left(\...
...ray}{c}
v_{1}\\
v_{2}\\
\vdots\\
v_{r}\end{array}\right)\,.
\end{displaymath} (9)

Here $A$ is a term-document matrix for a particular collection, each column of it represents a document using the vector space model, the elements are term weights within the documents. $r<\mathrm{rank}\left(A\right)$ is the rank to which the SVD is truncated, $u_{1},\, u_{2},\,\ldots,\, u_{r}$ and $v_{1},\, v_{2},\,\ldots,\, v_{r}$ are the left and the right singular vectors respectively. $u_{i}$ is a column vector with the number of elements equal to the vocabulary size of the collection. $v_{i}$ is a row vector having the same number of elements as the number of documents in the collection. Define column vectors $x_{k}=\left(x_{k1},\, x_{k2},\,\ldots,\, x_{kr}\right)^{\top}=\left(v_{1k},\, v...
...dots,\, v_{rk}\right)^{\top}\,,\: k=1,\,\ldots,\,\left\vert c_{i}\right\vert\,.$ Each $x_{k}$ represents a document $d_{ik}$ from the collection $c_{i}$ in the low-dimensional space. Every component $x_{ki}$ of the vector $x_{k}$ contains information about a mixture of terms. Every term can contribute to more than one component of the vector. The terms that frequently co-occur in the text are represented by the same components. This has brings both advantages and drawbacks. On one hand, synonyms and ohter related words tend to into the same component, thus a collection having many related words to the ones used in the query is likely to be allocated a larger number of results by the resource allocation algorithm. On the other, the contribution of exactly matching words becomes lower and collections having a small number of very relevant documents can be allocated too few results.

We model the distribution of $x_{k}$ as normal multivariate with a diagonal covariance matrix. The rationale for this is that (orthogonal) singular vectors $v_{i}$ tend to have correlation close to zero (it did not exceed 0.1 in our experiments).

The distribution mean $\overline{x}$ can be estimated with the sample mean:

\begin{displaymath}
\overline{x}=\frac{1}{\left\vert c_{i}\right\vert}\sum_{k=1}^{\left\vert c_{i}\right\vert}x_{k}\,.
\end{displaymath} (10)

The variance $\xi_{n}^{2},\: n=1,\,\ldots,\, r$ of each component of vectors $x_{k}$ can also be estimated as sample variance:
\begin{displaymath}
\xi_{n}^{2}=\frac{1}{\left\vert c_{i}\right\vert-1}\sum_{k=1...
...t c_{i}\right\vert}\left(x_{kn}-\overline{x}_{n}\right)^{2}\,.
\end{displaymath} (11)

Define a diagonal covariance matrix $\Xi\,\left(r\times r\right)$ as $\Xi=\mathrm{diag}\left(\xi_{1}^{2},\,\xi_{2}^{2},\,\ldots,\,\xi_{r}^{2}\right)\,.$ After some calculations using the linear properties of the normal distribution we are finally able to compute the probability density function:


\begin{displaymath}
\begin{array}{rl}
\overline{\alpha}= & q^{\top}U\Sigma\overl...
...{\alpha-\overline{\alpha}}{\xi}\right)^{2}\right]\,.\end{array}\end{displaymath} (12)

Here  $U=\left(u_{1},\, u_{2},\,\ldots,\, u_{r}\right)$and  $\Sigma=\mathrm{diag}\left(\sigma_{1},\,\sigma_{2},\,\ldots,\,\sigma_{r}\right)\,.$

The directory service must contain $r$ (the rank of LSI) elements of matrix $U$ per every term (or generally feature) present in the collection. It must also contain several items of relatively small size that are not term-related, like the distribution mean $x\,.$

4 Criteria

We need some formal criteria to perform experimental evaluation and comparison of resource allocation algorithms. Following the approach taken in ([9]), we define a function $\mathtt{merit}\left(q,\, d\right)$ that shows how useful is the document $d$ with respect to the query $q$ (the function was applied to collections rather than to individual documents in the original Gravano's work). There is more than one way to define the function, leading to different criteria; one of them is discussed below. Given the merit function, one can define the total merit  $M\left(q,\, n\right)$ of best $n$ documents in all available collections w.r.t. the query $q$ simply as a sum of top $n$ merits of documents taken from the union of all available collections:  $M\left(q,\, n\right)=\max_{\left.D\subset\left(\bigcup c_{i}\right)\,\right\ver...
...right\vert\leq n}\left(\sum_{d\in D}\mathtt{merit}\left(q,\, d\right)\right)\,.$ Similarly, one can define the total merit  $M_{i}\left(q,\, n\right)$ of best $n$ documents present in the collection $c_{i}$as a sum of top $n$ merits of document taken only from this collection: $M_{i}\left(q,\, n\right)=\max_{\left.D\subset c_{i}\,\right\vert\,\left\vert D\right\vert\leq n}\left(\sum_{d\in D}\mathtt{merit}\left(q,\, d\right)\right)\,.$ A recall-like measure can be defined as a share of the total merit available in all collections for $N$ documents, obtained through the allocation vector  $p=\left\{ p_{i}\right\} \,:$

\begin{displaymath}
\mathtt{Rec}\left(p,\, q,\, N\right)=\frac{\sum_{i=1}^{\left...
...ht\vert}M_{i}\left(q,\, p_{i}\right)}{M\left(q,\, N\right)}\,.
\end{displaymath} (13)

For any allocation vector $p$ that requests no more documents from the collection nodes than the maximum number of results allowed by the user (i.e., where $\sum_{i=1}^{I}p_{i}\leq N$) the value of the recall measure $\mathtt{Rec}()$ lies between 0 and 1. For any value of $N>0$ there always exist at least one optimal allocation vector $r$ with the property $\mathtt{Rec}\left(r,\, q,\, N\right)=1\,,$ i.e., discovering all the merit available in any $N$ documents from the union of collections.

Different definitions of $\mathtt{merit}\left(q,\, d\right)$ lead to different recall values. The most natural definition is that the function takes the value of 1 if the document $d$ is relevant to the query $q\,,$ and 0 otherwise. This corresponds to the relevance based ranking (RBR) baseline defined in [8]. The value of the function defined in equation 13 becomes in this case equal to the recall (in the traditional IR sense) of a system where each collection node searches perfectly and $p_{i}$ top results are obtained from each collection node $c_{i}\,.$ By a perfect node search we mean that if the collection $c_{i}$ contains $x$ documents relevant to the query, it returns  $\min\left\{ x,\, p_{i}\right\} $ relevant documents when $p_{i}$ results are requested. The value of the relevance-based recall measure is that it reflects well the prediction quality of the number of relevant documents in collections.

Equation 13 defines the recall value for a single query $q$ and for a user-specified maximum number of results $N\,.$ Testing is normally done with multiple queries, so the recall values get averaged over all queries for every value of $N$ and presented in a graphical form.

5 Experimental results

Old news feeds from the US Federal Broadcast Information Service (FBIS) were used as the source of text documents. The FBIS-3 corpus was split into 20 collections according to the files containing the documents: 12 files with consequent numbers form each collection. The collections contain no duplicates, i.e. no document appears in more than one collection. The total number of documents in all collections was 61578. The numbers of documents in each collection fluctuate between 1185 and 4287. A set of 50 queries was used, the ones from the TREC-6 ad-hoc runs. They are listed in TREC sources as queries 301 to 350.

The performance of both statistical models described above was tested, with the maximum number of results $N$ requested by the user varying between 10 and 1000. Recall as defined in equation 13 and discussed in section 4 was calculated for the recall-based ranking baseline. The LSI rank used was 128. The Recall graphs are shown in figure 2.

\includegraphics[%%
width=0.75\columnwidth]{all-vs-rbr.eps}

Figure 2: Recall $\mathtt{Rec}\left(p,\, q,\, N\right)$ averaged over all queries
The "SBR allocation" curve is given for reference only: it displays the performance of a trivial algorithm, Size Based Ranking, that requests from each collection the number of results proportional to the total number of documents in the collection (independent of the query). The "Theoretic LSI maximum" curve is an allocation induced by actual LSI scores for the same rank (128), without any statistical approximations. The values of $N$ in the vicinity of 100 are of the highest interest, because these are the values most likely to be requested by a user, and also because this is the mean number of documents relevant to a query in the union of all collections. All the considered allocation methods have a distinct performance minimum in this region.

6 Conclusion

A method for effective and efficient propagation of queries to collection nodes in a distributed system was presented. It employs a directory service like LDAP for storing collection descriptions that allows for independent updates of these by the collection nodes themselves. The algorithm for collection selection is generic enough to allow for multiple statistical models and can be used in various settings, not only Web search. Two statistical models for collection description were presented and compared. The LSI-based model, in spite of significantly higher storage requirements, have shown only moderate improvements in comparison to a simpler vGlOSS-derived model.

References

[1] Christoph Baumgarten. "A probabilitstic solution to the selection and fusion problem in distributed information retrieval." In Proc. of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 246-253, Berkeley, CA, USA, August 1999. ACM Press.

[2] M. Bessonov, U. Heuser, I. Nekrestyanov, and A. Patel. "Open architecture for distributed search systems.&qout; Lecture Notes in Computer Science, 1597:55-69, 1999.

[3] D. Boles, M. Dreger, and K. Großjohann. "MeDoC information broker. Harnessing the information in literature and full text databases." In Proc. of the ACM SIGIR'96 Workshop on Networked Information Retrieval, Zurich, Switzerland, August 1996. ACM Press.

[4] James P. Callan, Zhihong Lu, and W. Bruce Croft. "Searching distributed collections with inference networks." In E. A. Fox, P. Ingwersen, and R. Fidel, editors, Proc. of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 21-28, Seattle, WA, USA, July 1995. ACM Press.

[5] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. "Indexing by latent semantic analysis." Journal of the American Society for Information Science, 41(6):391-407, 1990.

[6] Chris H. Q. Ding. "A similarity-based probability model for latent semantic indexing." In Proc. of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 58-65, Berkeley, CA, USA, August 1999. ACM Press.

[7] James C. French, Allison L. Powell, and Jamie Callan. "Effective and efficient automatic database selection." Technical Report CS-99-08, Department of Computer Science, University of Virginia, February 22 1999. Available online at HREF4.

[8] James C. French, Allison L. Powell, Charles L. Viles, Travis Emmitt, and Kevin J. Prey. "Evaluating database selection techniques: a testbed and experiment." In Proc. of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 121-129, Melbourne, Australia, August 1998. ACM Press.

[9] Luis Gravano and Héctor García-Molina. "Generalizing GlOSS to vector-space databases and broker hierarchies." In Umeshwar Dayal, Peter M. D. Gray, and Shojiro Nishio, editors, Proc. of the 21st International Conference on Very Large Data Bases (VLDB'95), pages 78-89, Zurich, Switzerland, 11-15 September 1995. Morgan Kaufmann.

[10] Luis Gravano, Héctor García-Molina, and Anthony Tomasic. "The effectiveness of GlOSS for the text-database discovery problem." In Proc. of the 1994 ACM SIGMOD International Conference on Management of Data, Minneapolis, MN, USA, May 24-27 1994.

[11] Luis Gravano, Héctor García-Molina, and Anthony Tomasic. "GlOSS: Text-source discovery over the Internet." ACM Transactions on Database Systems, 24(2):229-264, June 1999.

[12] Ray R Larson. "A logistic regression approach to distributed IR." In Proc. of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 399-400, Tampere, Finland, 2002. ACM Press.

[13] E. L. Lehmann and G. Gasella. Theory of Point Estimation. Springer Texts in Statistics. Springer-Verlag, second edition, 1998.

[14] National Information Standards Organization. Information Retrieval (Z39.50): Application Service Definition and Protocol Specification (ANSI/NISO Z39.50. I995). NISO Press, Betheda, MD, USA, 1995. Available online at HREF5.

Hypertext References

HREF1
http://www.informatik.uni-tuebingen.de
HREF2
http://www-ti.informatik.uni-tuebingen.de
HREF3
http://www.uni-tuebingen.de
HREF4
ftp://ftp.cs.virginia.edu/pub/techreports/CS-99-08.ps.Z
HREF5
http://lcweb.loc.gov/z3950/agency/

Footnotes

[*]
The current formulation of the collection selection problem imposes no penalty for retrieving more documents than there are relevant documents in all collections combined. Under such assumptions it makes sense to fetch a document even if its chances of being relevant are tiny, so the original condition $\sum_{i}p_{i}\leq N$ turns into equality.


Copyright

Mikhail Bessonov, © 2004. The authors assign to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The authors also grant a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers and for the document to be published on mirrors on the World Wide Web.