An Evolutionary Approach for Searching Distributed Collections

Hui Yang, Ph.D student, School of Information Technology and Computer Science, University of Wollongong, Wollongong, 2520. Email: hy92@uow.edu.au

Dr.Minjie Zhang, Senior Lecturer, School of Information Technology and Computer Science, University of Wollongong, Wollongong, 2520, Email: minjie@uow.edu.au

Abstract

This paper proposes an evolutionary approach for searching information collections in a distributed environment.  This approach, which is given a set of information collections and user queries, finds the best possible matching between the information collections and user queries by taking into account the factors from both user queries perspective and collections’ perspective. The contribution of each factor in a searching model is indicated by a weighted parameter derived by a genetic algorithm. The values of these parameters are used to determine the contribution of the corresponding factors to the effectiveness of selection. Some preliminary results are demonstrated. Although inconclusive, they provide directions for possibly fruitful research and applications.

Introduction

The need to search multiple collections in a distributed environment is becoming increasingly important as the sizes of individual collections grow and network information services proliferate (Baeza-Yates and Ribeiro-Neto, 1999). The most important issue in searching a set of distributed collections is how to find the right collections to search for a query. It will generally be too expensive in terms of both computer and communication resources and the user’s time to search every collection in a distributed environment. Furthermore, many collections are proprietary and may charge users for searching them. A better method for timely and economic retrieval is to constrain the scope of searching to those collections which are likely to contain relevant documents for a query.

      Genetic algorithms (GAs) are search algorithms that reflect in a primitive way some of the processes of natural evolution (Goldberg, 1989). GAs provides us with very efficient search mechanisms that can be used in optimization or classification applications. Most optimization paradigms start with a diverse set of points (hyperspace vectors), called a population. A paradigm typically generates a new population with the same number of members each epoch, or generation based upon fitness values. Operators used by the paradigms, such as crossover and mutation, effectively enhance this parallel search capability, allowing the search to directly “tunnel through” from one promising hyperspace region to another. GAs approaches are often attractive for use in optimization problems that involve dynamic systems. This technique has been successfully applied to a wide variety of applications such as job assignment, forecasting, and classification (Cheng, Gen and Tsujimura, 1996; Lee, Lee and Chang 1997; Jan, 1998).

      In this paper, we describe an evolutionary approach for collection selection in a distributed environment. In this approach, a set of factors used to produce the output can be linearly combined with each other. Each factor is weighted by a coefficient according to its level of participation in the searching strategy. In essence, the value of each factor derived by the GAs determined the contribution of the factors and would range between 0 and 100%. A genetic algorithm (GA) is employed to determine the contribution of each factor to the decision of whether or not to search a collection for satisfying the user query.  Our experimental results have shown that the GA based system is effective in producing high-quality searching results for the user queries.

      The remainder of the paper is structured in the following manner. Firstly, Section 2 simply describes the problem of searching distributed collections. Section 3 proposes the basic principle of a matching algorithm. In section 4, we discuss the implementation of genetic algorithm and fitness evaluation. Section 5 contains our preliminary results and analysis. This is followed by conclusion and future work in Section 6.

Problem Description

In a simplest distributed model, users’ search requests are forwarded by a broker to a carefully selected subset F of a long list S of known information collections. All information collections operate a query-processing service restricted to the documents which they serve. The broker endeavors to merge the results obtained from the collections in F into a ranked list which will best meet the user’s need. The main reasons for selecting a subset of S are reduction of network or collection access costs and improvement in the timeliness of result delivery. However it is possible that a highly effective collection selection method might also result in a better combined ranking of documents. The relationship between search effectiveness (precision-recall performance) and cost of searching is of particular interest for distributed information retrieval (DIR).

      Brokers should take care of a lot of factors from both user queries’ perspective and from information collections’ perspective. The factors from user queries’ perspective include the number of relevant documents in search results, the response time for the user’s query collections and the change paid for searching relevant information on some proprietary collections.  The factors from information collections’ point of view include the cost of communication resource used for delivering retrieved results and the cost of computer resource used for query-processing service on collections. Brokers have to consider all these factors while assigning a subset of collections to search for the user’s query. To the best of our knowledge till now genetic algorithms have not been used for solving collection selection problem. This paper describes a genetic algorithm based matching approach to match information collections and user queries.

        The objective function used here is to maximize user queries’ satisfaction and to minimize search cost on information collections.

       Let  be the set of user queries and  be the set of collections. Then the objective function is defined as

                                                   (1)

where the first term corresponds to the total user queries’ satisfaction that measure how  “satisfactory” the users are with the retrieval results; and the latter corresponds to the total cost of the resource consumption on m collections for processing all  n users’ queries.

Matching Algorithm

The matching algorithm used here takes the collections’ cost for user queries and the user queries’ retrieval effectiveness for collections and finds the best possible matching. This matching depends on the preference list of collections for queries. The proposed approach takes the preference list of collections for queries as an input. Then the list of qualified collections for each query is computed and then they are sorted according to a score function, which is described in the next section. A genetic algorithm is used to train this score function. The score function controls the matching algorithm because the ranked list of qualified collections for each query is computed using the score function. The matching algorithm maintains a list of available collections and a list of a query queue. It considers both each collection’s retrieval effectiveness for queries and each collection’s retrieval cost for queries. If one collection has high retrieval effectiveness and low cost for the query, this collection will be searched by the broker for the query. If another collection also has the same preference and is better than that collection for the query, this better collection will top the ranked list of qualified collections. The broker forwards the user query to some top-rank collections on the list of qualified collections as potential “good” information collections to search. Then that query will be deleted from the list of the query queue. The algorithm continues until the list of the query queue is exhausted.

 

Genetic Algorithm Implementation

 

 

 

 

 

 

 

 

 

 

 


Figure 1: A framework of a Collection Selection System by an evolutionary approach

 

This paper proposes an evolutionary approach to determine the best possible matching between collections and user queries. The framework of Collection Selection System by an evolutionary approach, which consists of the stages is shown in Figure 1. The collection selection problem is represented in the early stages of the system as strings. We use a string to represent a chromosome in GA. Each chromosome can be represented a sequence of parameters which are corresponding to genes in GA. The following subsections describe the different stages of the system. System input is the arrays representing the list of available collections and the list of the query queue.

     An initial population is randomly generated and successive sequence populations called generations are derived by applying the selection, crossover and mutation operators to the previous sequence generation. As shown in Figure 1 a current generation is acted upon by the three operators to produce the successive generation.

Parameters

     As described in Section 3, the score function is the main part of the matching algorithm. Currently, there are 5 parameters used in the score function to test the feasibility of the approach. It can be easily extended to any number of parameters so that all the factors both from collection’s and query’s perspective are taken into consideration.

     Computing these parameters’ value is only the first step in the implementation of this genetic algorithm. These parameters and the value of the parameters are explained in more detail below.

1.    Query-Relevance Parameter

This parameter reflects the query relevance of one collection for a particular query, which is denoted by a relevance matrix shown below.

                                             Matrix 1

          be the query relevance of the jth collection for the ith query.

2.    Response-Time Parameter

        This parameter is used to measure the processing time of one collection for the query. For convenience, the value of the parameter is kept in the interval (0,1). The value of this parameter will be smaller when the response time of collection becomes longer.

3.    Charge-Searching-Information Parameter

         To protect the rights of IPR/copyrighted material on some proprietary collections, it is important for collections to charge a fee for retrieving IPR/copyrighted material. This value of this parameter is scaled to the range between 0 and 1, which will decrease as the charge of searching increases.

4.    Cost-Communication-Resource Parameter

        This parameter reflects the traffic between the collection to the user interface. It usually returns a value from 0 to 1. The more congested the traffic, the smaller the value for this parameter.

5.    Cost-Computer-Resource Parameter

         Cost-Computer-Resource Parameter and Response-Time Parameter have a tightly coupled relationship. Changes in the cost are often precursors to changes in the response time of the collection. The more the cost of used computer resource, the longer the response time for the query and the smaller the value for this parameter.  The value of the parameter is defined in [0,1].

 

       λ 1, λ 2, λ 3, λ4 and λ 5 are used respectively to denote these parameter values.

       When a collection receives a query, it rates itself using a score function that combines the scores of the different parameters in a linear fashion. The score function employs the five parameters above and a threshold value:

 

                                                 (2)

where T is the Threshold; is the importance of parameter i on the process of collection selection. We assume that the weights are normalized and the sum of these five weights is 1. The relative weight for the Query-Relevance Parameter is 0.6 since this parameter is more importance for collection selection than other four parameters. So the total of the remaining four parameters have weight 0.4. 

        Each parameter contributes to the value of the score function using its weight times its corresponding parameter value λ. If the score the collection offers is greater than threshold, then this collection will be assigned to a list of qualified collections for that particular query. The order of the qualified collections in turn effects the final matching between collections and user queries because the matching produced by the algorithm depends upon the order of the preference of collections for this query. The multiplication of the weight and the parameter value tells us the contribution of each of these parameters while searching a collection for a particular query.  A genetic algorithm is used to come up with the best set of values for these parameters so that the final matching is the best possible matching and the total satisfaction of user queries is maximized and the search cost on collections is minimized.

        In our algorithm, a GA chromosome consists of an array of 5 real numbers, one for each parameter. A conceptual representation is shown below (Figure 2). The range of these values is from 0 to1.

 

λ 1

λ 2

λ3

λ 4

λ 5

 

Figure 2: Chromosomal representation of the five parameters

Genetic Algorithm steps

The four main steps of this algorithm are introduced in details as follows:

1.    Generation of the First Population

This step represents the search’s starting point and is to complete the population with randomly chosen member.

2.    Selection Process

All GAs use some form of mechanism to decide which individuals from the current population should go into the mating pool (MP) that forms the basis of the next population generation. There are two popular approaches known to work well in such circumstance for implementing selection. The first is the tournament selection (Blickle and Thiele, 1995). The second is the roulette selection (Man, Tang & Kwong, 1999). We used tournament selection in most of the experiments of this approach because it reduces the selection pressure.  Tournament selection works in the following way: k individuals are randomly chosen from the population. The individual with the highest fitness among the selected k is placed in the MP. This process is repeated N times, where N is the size of the population. k is called the tournament size (which is 2 in our case) and it determines the degree to which the best individuals are favored. Once the collections for a particular query have been chosen, the individual with the high value of the score function is selected (). These collections will definitely form part of the new population. The remainder of the individuals in the next population is created by applying crossover and mutation to the rest of the collections in the MP. Thus, the next generation of collections is composed of the best of the individuals of the old population plus a number of newly created strategies.

3.    Crossover Process

The basic concept in crossover is to exchange gene information between chromosomes. An effective design of crossover operation would greatly increase the convergence rate of a problem. We randomly select two individuals from the population. Crossover points, , is then randomly chosen and sorted in an ascending order. Then the genes between successive crossover points are alternately exchanged between the individuals, with a probability. In our experiments, is between 0.6 and 0.8 and the value of c is 1.

4.    Mutation Process

  Because new genetic information will not be generated by the crossover procedure, the mutation procedure becomes an important mechanism to explore new genetic information. The process of mutation randomly alters the genes of the chromosomes and takes on the role of restoring lost genetic material that has not be generated in the population during the initialization procedure and creates individuals for new generations. To prevent the disruption of good individuals due to the random alternation of genes during mutation, mutation is inhibited for some parameters (genes) and the issue-related genes such as Query-Relevance Parameter in our case. This is because these genes effectively determine the collection selection and, therefore, they set the conditions in which a particular hypothesis is to be tested. The remaining genes are given a chance of 0.01 for undergoing mutation.

Fitness Evaluation

The objective function here is to maximize user queries’ satisfaction and to minimize search cost on information collections. So the fitness function should consider both of these factors.

                                                 (3)

Where:

·        m is the total number of user queries;

·        n is the total number of collections for searching;

·        is the fraction of the top r ranked documents that are relevant to the query:

                                                          (4)

Where r is the number of retrieved documents that are relevant, is the total number of retrieved documents;

·        is the total cost of resource consumption on the ith collection for the ith query.

·        is the Maximum possible cost of one collection for any query.

·        is the response time when the user gets the retrieval results for the ith query.

is the Maximum possible time when the user gets the retrieval results for any query.

Preliminary Experimental Results   

Finally, we need to determine if the best subset of collections from the GA represents a good collection selection strategy. We choose a validation period of similar length to the evaluation period and use the GAs best subset of collections for the queries to create retrieval results over that time period. This was compared to a control individual who randomly forwards the queries to the random subset of collections to search relevant information. The implementation of the GA has been done in JDK 1.31 and tested on a machine with 350 MHz Pentium processor and 192 MB RAM. Our testbed was based on the Reuters 21578 Distributed 1.0 data set which is consisted of 21578 articles and 135 topic categories from the Reuters newswire [HREF 1]; The collections in the data set are indexed separatedly to simulate a real-world distributed IR system. All the experiments discussed in this subsection are conducted with crossover rates 0.6-0.8 and a constant mutation rate of 0.01. Tournament selection of size 2 was used.

      Table 1 shows the results of the best subset of collections produced by the GA compared against a control individual. We used a set of 30 collections and 10 queries. As expected, we got some encouraging preliminary experimental results. The GA performed better than the control individual. There is an increase as 24.7% from 0.97 to 1.21 in fitness value. But the performance didn’t reach the most optimization. We plan to continue with original experimental parameters to investigate a more robust selection strategy. The emphasis will be put on the design of the objective fitness function and the score function which are probably too simplistic for such a complex problem.

 

 

Fitness Value

Best GA searching

1.21

Control individual

0.97

 

Table 1: The performance comparison of three different collection selection approaches

Conclusions and Future Work

In this paper, an evolutionary approach for solving the collection selection problem of distributed information retrieval has been presented. The preliminary experimental results exhibit that the genetic-basic collection selection system is capable of providing user queries with effective, timely and economical retrieval results.

      The further work of this research includes to carefully choose the values of parameters, and to improve the performance of this approach. We used synthetic datasets for our current experiments, as we do not have access to real data on Internet at present. Besides, there may be certain aspects of the experimental premise that are weak or incorrect. The five parameters we chose are based on intuition and experience. Further work will develop a complete system from this prototype by including more parameters by considering all possible factors.

Acknowledgement

This research was supported by a large grant from the Australian Research Council under contract DP0211282.

References

Baeza-Yates, R. and Ribeiro-Neto, B. (1999) Modern Information Retrieval. Addison ­Wesley, ACM Press, New York.

 

Blickle, T. and Thiele, L. (1995) “A comparison of selection schemes used in genetic algorithms” in Technical Report v.11, TIK.

 

Cheng, R., Gen, M., Tsujimura, Y. (1996) “A Tutorial Survey of Job ­ Shop Scheduling Problems using Genetic Algorithm-I” in Representation, Computers & Industrial Engineering v.30 n.4 p. 983-997.

 

Goldberg, D.E. (1989) Genetic Algorithms in Search, Optimization & Machine Learning. Addison ­Wesley, Readings, MA.

 

Lee, D. G, Lee, B W., and Chang, S. H. (1997) “Genetic Programming Model for Long-term Forecasting of Electric Power Demand” in Electric Power Systems Research v.40, p. 17-22.

 

Jan, J. F. (1998) “Genetic Programming for Classification of Remote Sensing Data in Taiwan Journal of Forest Science v.13 n.2 p. 109-118.

 

Wan, K.F., Tang, K.S. and Kwong, S. (1999) Genetic Algorithms Concepts and Designs. Springer, London.

Hypertext References

HREF1

Lewis, D. D. Reuter-21578 Text Categorization Test Collection Distribution 10. Available at http://www.research.att.com/~lewis.

Copyright

Andrew Treloar, © 2000. The authors assign to Southern Cross University and other educational and non-profit institutions a non-exclusive license to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The authors also grant a non-exclusive license to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers and for the document to be published on mirrors on the World Wide Web.