Hui Yang, Ph.D student, School of Information Technology and Computer Science, University of Wollongong, Wollongong, 2520. Email: hy92@uow.edu.au
Dr.Minjie Zhang, Senior Lecturer, School of Information Technology and Computer Science, University of Wollongong, Wollongong, 2520, Email: minjie@uow.edu.au
This paper proposes an evolutionary approach for searching information collections in a distributed environment. This approach, which is given a set of information collections and user queries, finds the best possible matching between the information collections and user queries by taking into account the factors from both user queries perspective and collections’ perspective. The contribution of each factor in a searching model is indicated by a weighted parameter derived by a genetic algorithm. The values of these parameters are used to determine the contribution of the corresponding factors to the effectiveness of selection. Some preliminary results are demonstrated. Although inconclusive, they provide directions for possibly fruitful research and applications.
The need to search multiple collections in a distributed
environment is becoming increasingly important as the sizes of individual
collections grow and network information services proliferate (Baeza-Yates and
Ribeiro-Neto, 1999). The most important issue in searching a set of distributed
collections is how to find the right collections to search for a query. It will
generally be too expensive in terms of both computer and communication
resources and the user’s time to search every collection in a distributed
environment. Furthermore, many collections are proprietary and may charge users
for searching them. A better method for timely and economic retrieval is to
constrain the scope of searching to those collections which are likely to
contain relevant documents for a query.
Genetic
algorithms (GAs) are search algorithms that reflect in a primitive way some of
the processes of natural evolution (Goldberg, 1989). GAs provides us with very
efficient search mechanisms that can be used in optimization or classification
applications. Most optimization paradigms start with a diverse set of points
(hyperspace vectors), called a population.
A paradigm typically generates a new population with the same number of members
each epoch, or generation based upon
fitness values. Operators used by the paradigms, such as crossover and
mutation, effectively enhance this parallel search capability, allowing the
search to directly “tunnel through” from one promising hyperspace region to
another. GAs approaches are often attractive for use in optimization problems
that involve dynamic systems. This technique has been successfully applied to a
wide variety of applications such as job assignment, forecasting, and
classification (Cheng, Gen and Tsujimura, 1996; Lee, Lee and Chang 1997; Jan,
1998).
In this
paper, we describe an evolutionary approach for collection selection in a
distributed environment. In this approach, a set of factors used to produce the
output can be linearly combined with each other. Each factor is weighted by a
coefficient according to its level of participation in the searching strategy.
In essence, the value of each factor derived by the GAs determined the
contribution of the factors and would range between 0 and 100%. A genetic
algorithm (GA) is employed to determine the contribution of each factor to the
decision of whether or not to search a collection for satisfying the user
query. Our experimental results have
shown that the GA based system is effective in producing high-quality searching
results for the user queries.
The remainder
of the paper is structured in the following manner. Firstly, Section 2 simply
describes the problem of searching distributed collections. Section 3 proposes
the basic principle of a matching algorithm. In section 4, we discuss the
implementation of genetic algorithm and fitness evaluation. Section 5 contains
our preliminary results and analysis. This is followed by conclusion and future
work in Section 6.
In a simplest distributed model, users’ search requests
are forwarded by a broker to a
carefully selected subset F of a long list S
of known information collections. All
information collections operate a query-processing service restricted to the
documents which they serve. The broker endeavors to merge the results obtained
from the collections in F into a ranked list which will best
meet the user’s need. The main reasons for selecting a subset of S
are reduction of network or collection access costs and improvement in the
timeliness of result delivery. However it is possible that a highly effective
collection selection method might also result in a better combined ranking of
documents. The relationship between search effectiveness (precision-recall
performance) and cost of searching is of particular interest for distributed
information retrieval (DIR).
Brokers
should take care of a lot of factors from both user queries’ perspective and
from information collections’ perspective. The factors from user queries’
perspective include the number of relevant documents in search results, the
response time for the user’s query collections and the change paid for
searching relevant information on some proprietary collections. The factors from information collections’
point of view include the cost of communication resource used for delivering
retrieved results and the cost of computer resource used for query-processing
service on collections. Brokers have to consider all these factors while
assigning a subset of collections to search for the user’s query. To the best
of our knowledge till now genetic algorithms have not been used for solving
collection selection problem. This paper describes a genetic algorithm based
matching approach to match information collections and user queries.
The objective
function used here is to maximize user queries’ satisfaction and to minimize
search cost on information collections.
Let
be
the set of user queries and
be
the set of collections. Then the objective function is defined as
(1)
where the first term corresponds to the total user queries’ satisfaction that measure how “satisfactory” the users are with the retrieval results; and the latter corresponds to the total cost of the resource consumption on m collections for processing all n users’ queries.
The matching algorithm used here takes the collections’
cost for user queries and the user queries’ retrieval effectiveness for
collections and finds the best possible matching. This matching depends on the
preference list of collections for queries. The proposed approach takes the
preference list of collections for queries as an input. Then the list of
qualified collections for each query is computed and then they are sorted according
to a score function, which is described in the next section. A genetic
algorithm is used to train this score function. The score function controls the
matching algorithm because the ranked list of qualified collections for each
query is computed using the score function. The matching algorithm maintains a
list of available collections and a list of a query queue. It considers both
each collection’s retrieval effectiveness for queries and each collection’s
retrieval cost for queries. If one collection has high retrieval effectiveness
and low cost for the query, this collection will be searched by the broker for
the query. If another collection also has the same preference and is better
than that collection for the query, this better collection will top the ranked
list of qualified collections. The broker forwards the user query to some
top-rank collections on the list of qualified collections as potential “good”
information collections to search. Then that query will be deleted from the
list of the query queue. The algorithm continues until the list of the query
queue is exhausted.
![]() |
Figure
1: A framework of a Collection Selection System by an evolutionary approach
This paper proposes an evolutionary approach to determine
the best possible matching between collections and user queries. The framework
of Collection Selection System by an evolutionary approach, which consists of
the stages is shown in Figure 1. The collection selection problem is
represented in the early stages of the system as strings. We use a string to
represent a chromosome in GA. Each chromosome can be represented a sequence of
parameters which are corresponding to genes in GA. The following subsections
describe the different stages of the system. System input is the arrays
representing the list of available collections and the list of the query queue.
An initial
population is randomly generated and successive sequence populations called
generations are derived by applying the selection, crossover and mutation
operators to the previous sequence generation. As shown in Figure 1 a current
generation is acted upon by the three operators to produce the successive
generation.
As described
in Section 3, the score function is the main part of the matching algorithm.
Currently, there are 5 parameters used in the score function to test the
feasibility of the approach. It can be easily extended to any number of
parameters so that all the factors both from collection’s and query’s
perspective are taken into consideration.
Computing
these parameters’ value is only the first step in the implementation of this
genetic algorithm. These parameters and the value of the parameters are
explained in more detail below.
1. Query-Relevance Parameter
This parameter reflects the
query relevance of one collection for a particular query, which is denoted by a
relevance matrix shown below.
Matrix 1
be
the query relevance of the jth collection for the ith query.
2.
Response-Time Parameter
This parameter is used to measure the processing time of
one collection for the query. For convenience, the value of the parameter is
kept in the interval (0,1). The value of this parameter will be smaller when
the response time of collection becomes longer.
3. Charge-Searching-Information
Parameter
To
protect the rights of IPR/copyrighted material on some proprietary collections,
it is important for collections to charge a fee for retrieving IPR/copyrighted
material. This value of this parameter is scaled to the range between 0 and 1,
which will decrease as the charge of searching increases.
4.
Cost-Communication-Resource
Parameter
This parameter reflects the traffic
between the collection to the user interface. It usually returns a value from 0
to 1. The more congested the traffic, the smaller the value for this parameter.
5.
Cost-Computer-Resource
Parameter
Cost-Computer-Resource
Parameter and Response-Time Parameter
have a tightly coupled relationship. Changes in the cost are often precursors
to changes in the response time of the collection. The more the cost of used
computer resource, the longer the response time for the query and the smaller
the value for this parameter. The value
of the parameter is defined in [0,1].
λ 1, λ 2, λ 3, λ4
and λ 5 are used respectively to denote these parameter values.
When a collection receives a query, it
rates itself using a score function that combines the scores of the different
parameters in a linear fashion. The score function employs the five parameters
above and a threshold value:
(2)
where T
is the Threshold;
is the importance
of parameter i on the process of collection selection. We assume that the weights
are normalized and the sum of these five weights is 1. The relative weight
for the Query-Relevance Parameter
is 0.6 since this parameter is more importance for collection selection than
other four parameters. So the total of the remaining four parameters have
weight 0.4.
Each parameter contributes to the
value of the score function using its weight times its corresponding parameter
value λ. If the score the collection offers is greater than threshold,
then this collection will be assigned to a list of qualified collections for
that particular query. The order of the qualified collections in turn effects
the final matching between collections and user queries because the matching
produced by the algorithm depends upon the order of the preference of
collections for this query. The multiplication of the weight and the parameter
value tells us the contribution of each of these parameters while searching a
collection for a particular query. A
genetic algorithm is used to come up with the best set of values for these
parameters so that the final matching is the best possible matching and the
total satisfaction of user queries is maximized and the search cost on
collections is minimized.
In our algorithm, a GA chromosome
consists of an array of 5 real numbers, one for each parameter. A conceptual
representation is shown below (Figure 2). The range of these values is from 0
to1.
|
λ 1 |
λ 2 |
λ3 |
λ 4 |
λ 5 |
Figure 2: Chromosomal representation of the five
parameters
The four main steps of
this algorithm are introduced in details as follows:
1.
Generation of the First Population
This step represents the
search’s starting point and is to complete the population with randomly chosen
member.
2.
Selection Process
All
GAs use some form of mechanism to decide which individuals from the current
population should go into the mating pool (MP) that forms the basis of the
next population generation. There are two popular approaches known to work
well in such circumstance for implementing selection. The first is the tournament
selection (Blickle and Thiele, 1995). The second is the roulette selection
(Man, Tang & Kwong, 1999). We used tournament selection in most of the
experiments of this approach because it reduces the selection pressure. Tournament selection works in the following
way: k individuals are randomly
chosen from the population. The individual with the highest fitness among
the selected k is placed in the
MP. This process is repeated N times,
where N is the size of the population. k is called the tournament size (which
is 2 in our case) and it determines the degree to which the best individuals
are favored. Once the collections for a particular query have been chosen,
the individual with the high value of the score function is selected (
).
These collections will definitely form part of the new population. The remainder
of the individuals in the next population is created by applying crossover
and mutation to the rest of the collections in the MP. Thus, the next generation
of collections is composed of the best of the individuals of the old population
plus a number of newly created strategies.
3. Crossover
Process
The
basic concept in crossover is to exchange gene information between chromosomes.
An effective design of crossover operation would greatly increase the convergence
rate of a problem. We randomly select two individuals from the population.
Crossover points,
,
is then randomly chosen and sorted in an ascending order. Then the genes between
successive crossover points are alternately exchanged between the individuals,
with a probability
. In our experiments,
is between 0.6 and 0.8 and the value of c
is 1.
4.
Mutation Process
Because new genetic information will not be
generated by the crossover procedure, the mutation procedure becomes an
important mechanism to explore new genetic information. The process of mutation
randomly alters the genes of the chromosomes and takes on the role of restoring
lost genetic material that has not be generated in the population during the
initialization procedure and creates individuals for new generations. To prevent
the disruption of good individuals due to the random alternation of genes
during mutation, mutation is inhibited for some parameters (genes) and the
issue-related genes such as Query-Relevance
Parameter in our case. This is because these genes effectively determine
the collection selection and, therefore, they set the conditions in which a
particular hypothesis is to be tested. The remaining genes are given a chance
of 0.01 for undergoing mutation.
The objective function here is to maximize user queries’
satisfaction and to minimize search cost on information collections. So the
fitness function should consider both of these factors.
(3)
Where:
·
m is the total number of user queries;
·
n is the total number of collections for searching;
·
is the fraction of the top r ranked documents that are relevant to the query
:
(4)
Where r is the number of retrieved documents
that are relevant,
is the total
number of retrieved documents;
·
is
the total cost of resource consumption on the ith collection for the ith query.
·
is the
Maximum possible cost of one collection for any query.
·
is the
response time when the user gets the retrieval results for the ith
query.
is
the Maximum possible time when the user gets the retrieval results for any
query.
Finally, we need to determine if the best subset of collections
from the GA represents a good collection selection strategy. We choose a
validation period of similar length to the evaluation period and use the GAs
best subset of collections for the queries to create retrieval results over
that time period. This was compared to a control individual who randomly
forwards the queries to the random subset of collections to search relevant
information. The implementation of the GA has been done in JDK 1.31 and tested
on a machine with 350 MHz Pentium processor and 192 MB RAM. Our testbed was
based on the Reuters 21578 Distributed 1.0 data set which is consisted of 21578
articles and 135 topic categories from the Reuters newswire [HREF 1]; The collections in the
data set are indexed separatedly to simulate a real-world distributed IR
system. All the experiments discussed in this subsection are conducted with
crossover rates 0.6-0.8 and a constant mutation rate of 0.01. Tournament
selection of size 2 was used.
Table 1 shows
the results of the best subset of collections produced by the GA compared
against a control individual. We used a set of 30 collections and 10 queries.
As expected, we got some encouraging preliminary experimental results. The GA
performed better than the control individual. There is an increase as 24.7%
from 0.97 to 1.21 in fitness value. But the performance didn’t reach the most
optimization. We plan to continue with original experimental parameters to
investigate a more robust selection strategy. The emphasis will be put on the
design of the objective fitness function and the score function which are
probably too simplistic for such a complex problem.
|
|
Fitness Value |
|
Best
GA searching |
1.21 |
|
Control
individual |
0.97 |
Table 1:
The performance comparison of three different collection selection approaches
In this paper, an evolutionary approach for solving the
collection selection problem of distributed information retrieval has been
presented. The preliminary experimental results exhibit that the genetic-basic
collection selection system is capable of providing user queries with
effective, timely and economical retrieval results.
The further
work of this research includes to carefully choose the values of parameters,
and to improve the performance of this approach. We used synthetic datasets for
our current experiments, as we do not have access to real data on Internet at
present. Besides, there may be certain aspects of the experimental premise that
are weak or incorrect. The five parameters we chose are based on intuition and
experience. Further work will develop a complete system from this prototype by
including more parameters by considering all possible factors.
This research was supported by a large grant from the
Australian Research Council under contract DP0211282.
Baeza-Yates,
R. and Ribeiro-Neto, B. (1999) Modern Information Retrieval. Addison
Wesley, ACM Press, New York.
Blickle,
T. and Thiele, L. (1995) “A comparison of selection schemes used in genetic
algorithms” in Technical Report v.11, TIK.
Cheng, R., Gen, M., Tsujimura, Y. (1996) “A Tutorial Survey of Job Shop Scheduling Problems using Genetic Algorithm-I” in Representation, Computers & Industrial Engineering v.30 n.4 p. 983-997.
Goldberg,
D.E. (1989) Genetic Algorithms in Search, Optimization & Machine
Learning. Addison Wesley, Readings, MA.
Lee,
D. G, Lee, B W., and Chang, S. H. (1997) “Genetic Programming Model for
Long-term Forecasting of Electric Power Demand” in Electric Power Systems
Research v.40, p. 17-22.
Jan, J. F. (1998) “Genetic Programming for Classification of Remote Sensing Data” in Taiwan Journal of Forest Science v.13 n.2 p. 109-118.
Wan,
K.F., Tang, K.S. and Kwong, S. (1999) Genetic Algorithms Concepts and
Designs. Springer, London.
HREF1
Lewis,
D. D. Reuter-21578 Text Categorization Test Collection Distribution 10.
Available at http://www.research.att.com/~lewis.
Andrew Treloar, © 2000. The authors assign to
Southern Cross University and other educational and non-profit institutions a
non-exclusive license to use this document for personal use and in courses of instruction
provided that the article is used in full and this copyright statement is
reproduced. The authors also grant a non-exclusive license to Southern Cross
University to publish this document in full on the World Wide Web and on CD-ROM
and in printed form with the conference papers and for the document to be
published on mirrors on the World Wide Web.