David Lowe [HREF1], Faculty of Engineering, University of Technology, Sydney, P.O. Box 123, Broadway, Sydney, 2007, Australia. david.lowe@uts.edu.au
Search engines provide an important mechanism for dealing with the exploding volume of information on the Web. Numerous sophisticated algorithms have been developed to improve both the performance of search engines and the quality of the results achieved (as measured by both precision of the results and the extent of recall). However the lack of effective consideration of the context of indexed pages limits the relevance of results obtained.
Our work supports the hypothesis that the relevance of search results can be improved through the inclusion of contextual information. This contextual information can be obtained by an automated analysis of the local information space surrounding a given candidate page - with the local space defined by the navigational structures inherent in the Web. In particular, we treat the content of linked pages as providing a useful context for a source page, and which can consequently be analysed and used to improve searching precision. The research described in this paper involves the development of a prototype search engine that uses this technique. We describe current approaches to searching, and the theory behind our approach. We then look at a prototype search engine that implements these concepts, and the subsequent evaluation of the improvement in information retrieval. The significance of this work includes a greatly improved understanding of context-based searching and a substantially enhanced Web searching capability.
The Web contains a phenomenal volume of information that is both rapidly expanding and extremely diverse. Search engines provide one of the key mechanisms enabling users to locate information within this vast information space. However current search engines typically are unable to effectively filter the content in way that provides a high level of precision combined with high recall. The results returned by a search are typically a mix of relevant and irrelevant content.
To illustrate the problems being faced, consider the following information (this case study will be revisited throughout the paper)
Illustrative Case StudyA recent research study of Year 7 student students, when given the task of finding out on the Net the answer to the question: How do birds sing? Showed a number of key dilemmas. It was not just the enormous volume of hits when using subject search terms such as "birds", "birds singing" and "singing" birds, but it created fundamental end-user searching dilemmas: dealing with the large number of items recalled; creating appropriate search terms and strings; understanding appropriateness of the various search engines; making relevance judgements; judging quality of relevant documents; managing the search process: navigation, productivity, time management, organisation of information. Is this the future of user interaction with the net? Todd, R. "Negotiating the Web: Language, critical literacies and learning" In: D. Gibbs & K-L. Krause (Eds) Cyberlines: Languages and cultures of the Internet. Melbourne: James Nicholas, 1999. |
Research on information contextualisation has shown that the structure implicit in Web content (i.e. the interlinked pages) can be interpreted as providing a context which is important in understanding the relevance of the content on a specific page. Using this as a basis, this paper describes research into improving the retrieval precision of search engines through the inclusion of contextual information derived from an automatic analysis of the information structure (implied by the navigational links) around candidate web pages. The result is a modified approach to searching the Web that potentially provides substantially improved search precision.
In developing an approach to searching based on an automated understanding of information context there are a number of key aspects that need to be considered:
This paper describes the initial steps in addressing these key issues. We begin the paper by discussing the background to this work, considering approaches to searching and looking at some of the underlying flaws. We use this discussion to establish the key hypothesis of this research. We then describe a research prototype that illustrates our proposed approach using the Web structure to contextualise searches. We consider the results of using the prototype, and evaluate and quantify the effect on search precision, in terms of result relevance, obtained through contextualisation. In particular we compare the results against conventional search tools. We finish by looking at the significance of the approach, future work that is required, and some final conclusions about the approach described.
The volume of information on the Web continues to swell at a phenomenal rate. Estimates for the size of the Web vary but the lower limits have increased from upward of 100 million in early 1997 to upwards of 800 million pages in mid 1999 (HREF2). A major research issue within numerous disciplines is how we find specific information within this growing information space (Filman and Pant 1998, McNicholas & Todd 1996). Web searching - the primary approaches to locating content on the Web - is one of the largest Internet-related industries (Lawrence and Giles 1998). There are two primary approaches that have been adopted by search engines vendors - directories and automated indexing. Directories (such as Yahoo (HREF3) and Magellan (HREF4)) are manually (or at best semi-automatically) maintained classifications of content. This manual classification provides an ability to obtain very good precision and some degree of content validation, but given the huge amount of content they are unable to achieve effective coverage of the Web. In others words, for any given search specification, a directory-based search engine will have a relatively small upper limit on the proportion of the possible relevant documents which can be retrieved - though those are retrieved are likely to be relevant (and higher quality, given that they have been manually selected). This is illustrated in Figure 1.

Figure 1: Precision and recall for Web directories vs automated indexing
An alternative approach is to use intensive automation - Web spiders / worms / crawlers etc. that trawl the Web and index as high a proportion of content as possible. Search engines such as AltaVista (HREF5) and Lycos (HREF6) use this approach, providing a user-interface to access the index. This approach can potentially result in a much greater proportion of the Web being indexed (though admittedly, this proportion is still small - estimated at the time of writing at around 20% for Northen Lights, 18.8% for AltaVista, 13.8% for Inktomi, and 6.3% for Lycos, for example. This increased proportion results in a much higher proportion of all possible relevant documents being recalled. The lack of effective classification can however result in much poorer precision - i.e. we tend to get a much higher proportion of irrelevant documents, or a lower "signal-to-noise" ratio. (Just as an interesting aside, some researchers have started looking at the consequences of "serendipitous" searching, where users gain greater satisfaction because they are seeing a higher proportion of "irrelevant" information).
Note: different disciplines tend to use the term "recall" in rather different ways. In this paper, we use the term to refer to the proportion of those documents that are relevant (i.e that the user would find useful) to those that we have actually retrieved. In other words, the optimal would be 100% recall - the user finds all the documents that are optimally relevant. This usage is inconsistent with the usage of the term in some domains.
Ideally, we would like an approach to searching which maintains a high level of recall of relevant documents (which given the quantity of information on the Web effectively requires automated analysis of content) whilst optimising the precision of the search. Various approaches to improving search precision have been investigated. The simplest of these is to use meta-search techniques (Selberg & Etzioni 1995). In this approach the meta-search engine passes the search query to multiple search engines and then compiles the results to obtain an increase in both recall (since different engines will have indexed different parts of the Web) and precision (since we can use correlation between the results). For this approach to work we need to be able to convert a search query into the form required for the various search engines and then analyse the results in order to merge them. We are also still constrained by the inherent limitations of the various search engines.
Various other techniques try to improve the underlying approach, rather than simply merging the results from multiple search engines. To achieve this we need to develop a clearer understanding of how indexed pages are related back to a specific search query. The simplest and most common search engines simply match the query to content maintained in the index (using approaches such as the Vector Space Model (Salton 1971)) and then rank the results using various ranking algorithms (Harman 1992). This however limits the achieved precision by constraining the matching to only specific indexed content. There are several more sophisticated ways in which the search can be improved. The first is to dynamically adapt the search algorithm in some way. As an example, various researchers have looked at ways of tracking users' browsing patterns and adapting the search algorithms based on navigation history. In other cases (such as MetaSeek (Benitez et al 1998)) users' feedback on search performance is used in the adaptation process. In these cases however it is the users' behaviour that is being used to improve the search, and hence the improvement is user specific.
An alternative approach is to improve precision by understanding the context of the possible search results. This allows us to remove (or lower the ranking of) results that have relevant content, but an inappropriate context. As a simple example, Lawrence and Giles (Lawrence & Giles 1998) developed the NECI meta-search engine that analyses the content of each candidate page (obtained from a conventional Web search engine) to determine itâs relevance to the search query. In other words, this approach uses the entire content of the candidate page in determining its relevance, rather than just the indexed terms.
Given the granularity of the information on the Web, using just the content of pages matching the search query to evaluate context can be very restrictive - in many cases content is broken into numerous small documents which are interlinked, and other documents are referenced which provide additional information. Work by Li (Li 1998) has looked at using the Web link structures (and the content of the link anchors) to modify document relevance rankings. This is based on the assumption that the greater the number of links to a document the higher the potential quality, and therefore relevance, of the document. This approach, though heading in a useful direction, is still limited in that it is highly dependent upon good authoring of link anchor text.
Based on the above discussions, and our own prior work on information contextualisation (Lowe and Bucknell 1997), we are proposing the following hypothesis:
Hypothesis: The content and structure of the information space surrounding documents is an important contextualising factor, which can be utilised by search engines in improving the accuracy of the relevance ranking assigned to documents.
In other words, an automated analysis not only of the content of candidate documents, but also the content of related (i.e. either directly or indirectly linked) documents and the structure of the relationships can be used to improve the effectiveness of search engines (Ellis 1996).
Illustrative Case Study (continued)To illustrate the use of linked documents to provide a context, consider a set of Web pages that provide information on birds (including information on birds singing. If we wished to search for information on "Birds singing" then we could type that in as a search query. However the specific page(s) which provides the desired information may mention "singing" but not the word "birds" - and so we would not retrieve the information. Alternatively we could use just the word "singing" but we would then retrieve (potentially vast) amounts of information on singing totally unrelated to birds. Our approach allows us to search for documents which contain the word "singing", but are associated with document(s) containing the word "birds". This is shown in Figure 2 below
Figure 2: Utilising related pages to provide a context |
Based on the above observations, we have developed an alternative approach to searching for Web content that achieves improved search result relevance. This is achieved through the inclusion of contextual information derived from an automatic analysis of the information structure (implied by the navigational links) around candidate web pages. The scope of this research covers two core elements. The first is the development of an informational model that captures the relationship between a document's context and the structure and content of linked documents. The second is the evaluation of the improvement in searching achieved through the application of this model to a specific search engine prototype.
In developing this approach, we can identify a number of key issues that must be addressed. We have formulated these as issues as a series of questions. For each question, we describe a strategy for developing a solution along with our current research progress.
Question: In what way can the content and structure of the local information space provide a context for Web documents? In order to utilise the local structure (where Îlocalâ is defined by linking, not existence on the same server) and content of the documents within this local information space to improve searching relevance, we need to understand how these elements provide a context for a Web document. This needs to be captured as a model that can be used in the development of improved approaches to searching.
Strategy: In order to construct an effective model of the contextual relationship between a Web document and the structure and content surrounding that document, we need to develop an understanding of the factors that we are able to evaluate, and the significance of these factors. For example, we need to consider elements such as the relative locations of search terms, context terms, and link anchors and destinations, the text of the anchors that link the documents, etc. For example, consider the situation where we have a document containing the title "Birds". Further down in this document we have a link with anchor text "Noises" and the destination of the link is a document with the title "Avian Singing". The text "Birds" provides a context for the text "Avian Singing". We can use existing information theory research on contextualisation to develop a model that applies to the Web.
At present we have developed a very simple model. We search for target pages using a conventional search engine, and then re-rank these target pages based on a contextual model. The current model utilises a series of context terms which must be present either in the target page itself, or in one of the pages to which the target page is linked. We modify the ranking further, based on the syntactic distance between the identified search term and the context term(s). The ranking adjustment is performed as follows:
ranking' = ranking * F1 * F2 * ...
where:
ranking' = the adjusted ranking ranking = the original ranking Fi = adjustment factor for context term i = (1.0 - x) if context term is missing = (1.0 + x)y/sd if context term is present sd = syntactic distance between search term and context term x,y = empirically determined constants
It is important to note that this is a highly simplistic model (used at this stage only as a basis for demonstrating the concepts). As will be discussed in the conclusions, a major area for further research is to refine this model and place it on a more substantial theoretical footing.
Question: How can the context model be used to guide the development of approaches to searching? The understanding of context encapsulated in the model will only be useful if it can be used to guide the creation of practical approaches to searching on the Web. The searching algorithms need to be able to extract, analyse and then utilise the local structure and content of related pages in evaluating context.
Strategy: We have investigated the relationship of the (initially rather simple) context model to the current implementation of the Web, and used this to develop algorithms for obtaining and utilising contextual information. This covers two key elements. The first is the development of algorithms that map the context model into the information structures supported by the Web. For example, we will need to consider how different elements of Web content (link structures, headings, titles, etc.) are utilised. The second element is to adapt the approach to the technical limitations of the Web. For example, the unidirectional nature of linking in the Web (at least until XLink becomes widely adopted) will mean that identifying pages which link to a selected page is difficult. Similarly, access speeds and server response times means that efficiency and progressive display of results as the context is analysed are important. The result of this stage will be a set of algorithms.
We have developed a set of algorithms (that have been encapsulated into the prototype to be discussed shortly). Essentially the user provides a set of both search terms and context terms. We then utilise an existing search engine to obtain a series of candidate pages (note that this inherently limits the recall of our approach to the recall achievable using conventional search engines - what we can do however is subsequently improve the precision).
For each candidate page we parse the content, and identify any linked documents. These linked documents are retrieved and analysed (along with the original document) to identify the existence and location of any context terms. As context terms are found the syntactic distance is calculated (the distance across a link is treated as a constant) and the target page ranking adjusted accordingly. To improve performance and feedback to the user, the results are displayed progressively to the user.
Question: How can the searching algorithms be implemented in the existing technological framework of the Web ? The approach being proposed is only appropriate if it can be implemented within the constraints of existing Web technologies.
Strategy: In order to evaluate the impact of using context in adjusting search engine rankings we have developed a prototype contextualising search engine. This engine implements the search algorithms that have been developed. The architecture of the search engine is shown in Figure 3. In this architecture, the prototype makes use of existing search engines to generate candidate search results, but then filter these results (or adapt their relevance ranking) based on contextual information retrieved from the Web. Figure 4 shows a screen dump of the research prototype.

Figure 3: Prototype search engine architecture

Figure 4: Screendump of prototype contextualising search engine
The prototype allows users to enter both a conventional search query, as well as a set of contextual terms, into a java applet. This applet then communicates with a conventional Web server - calling a CGI program written in Perl. The Perl program generates a set of candidate documents by passing the query to an existing mainstream search engine (the specific engine is configurable - and can be selected by the user in the Java applet). The Perl program passes the results back to the applet for initial display and then commences analysis of the candidates. For each candidate it retrieves the document, and then extracts any links. These documents are then retrieved and each document is searched for the existence of the context terms. Where a context term is found, the appropriate information is passed to the Java applet which then adjusts the displayed rankings. The approach developed for the prototype is highly simplistic, but has nevertheless provided initial results indicating that the potential for substantially improved relevance of search results.
As an aside, it is also worth noting that we are currently experimenting with alternative algorithms in order to improve performance. For example, rather than retrieving all documents and analysing them, we simply use the search terms to generate an initial list of candidates and then use the context terms to generate a second list. The original candidates are then adjusted based on potential relationships to the context candidates (e.g. existence on the same site). Work on these alternatives is however still at too early a stage to report on results.
Question: To what extent does the proposed approach improve the relevance of the results being achieved? In order to evaluate the usefulness of the proposed approach we need to evaluate it against conventional search tools. Most particularly, we are interested in the potential improvements that can be achieved in searching relevance. This evaluation also needs to take into account issues such as performance considerations, possible loss or increase of searching recall, etc.
Strategy: The prototype that has been developed has been evaluated through a series of experiments. These evaluations have measured the impact on the relevance ranking of the results returned by the contextualising search engine. A series of experiments were constructed that ask users to evaluate the relevance of a series of results returned by both conventional search engines and our modified search engine. The following table compares the correlation achieved between users evaluation of page relevance and the rankings achieved through both conventional search engines (averaged across a pool of five common engines) and the contextualising search engine.
| Search query | Conventional search engines | Contextualising search engine |
| Terms = Fish Context = Recipe |
p = 0.63 | p = 0.69 |
| Terms = Java Programming Context = Changes Recent Evolution |
p = 0.74 | p = 0.76 |
| Terms = Interest Rate Context = Mortgage Loan House Purchase |
p = 0.72 | p = 0.81 |
| Terms = Beatles Context = Music Lyrics |
p = 0.54 | p = 0.73 |
These (rather early) results illustrate that a significant improvement in the relevance of search results can be achieved through the use of the context provided by associated pages.
NOTE: These results are only preliminary. We are currently carrying out more thorough evaluation. A more comprehensive statistical analysis of the effectiveness of the prototype will be included in the final paper. We shall also make available an online version of the prototype for evaluation.
Question: How does this approach relate to alternative approaches to searching, and how can they be integrated or merged to optimise searching performance? A significant amount of research is being carried out on improvements to searching techniques. Having developed and validated an approach based on the use of local Web structure and content to contextualise searches, we need to be able to identify how this approach can be merged with other approaches to optimise performance. For example, can the contextual information that we derive be correlated with user models to adapt the search to individuals?
Strategy: We have yet to consider this area in detail, and this is a significant area for further research. We will look at existing search engine research and develop a series of strategies for integrating our approach into a hybrid search engine that provides optimal overall performance. In particular, we will develop a set of recommendations that provide guidance on how our approach may be used to improve the overall ability to locate information within the space represented by the Web.
The importance of effective mechanisms to search for content on the Web will only increase. The number of users of the Web is growing rapidly. Similarly the volume of information available on the Web continues to grow at a phenomenal rate, with lower estimates of the number of web pages accessible, as of mid-1999, exceeding 800 million pages (HREF2). As users become more dependent upon the internet for access to a wide variety of information resources, they will rely on sophisticated search tools to locate these resources. These tools need to be able take into account the specific context of each user, and locate content which is relevant to these contexts. The first step in this process will be to provide mechanisms for identifying the specific context of given content - something which is the focus of our approach.
The use of metadata to define a context is effective in constrained data sets, and is well supported by technologies such as RDF (Resource Description Framework). Metadata however requires too rigorous an approach to maintaining the metadata to be effective for general use on the Web. Hypermedia linking, on the other hand, is an inherent structuring mechanism that it is typically created as a natural part of the development of content for the Web. As we have explained, we can use this structure to support the automatic determination of context - we can view linked documents as providing an implicit form of metadata. As such it makes sense to leverage this structure (which we can do with very minimal additional cost using the approach we are proposing) to improve searching mechanisms. The result will be a substantially improved ability to locate relevant content without damaging recall.
The specific outcomes of this research include a practical model of the way in which interlinked Web documents provide contexts, the use of this model in a contextualising search engine prototype, and an evaluation of the effectiveness of this approach. This constitutes an important advance in our approach to searching for content on the Web. For users this will have immediate practical significance, but will also be important in terms of developing a fundamental improvement in our understanding of how to automatically evaluate the context of Web content.
We have proposed an approach to improving the precision of search engines, whilst maintaining the level of recall achievable, by utilising the local information space (as defined by linked pages) in providing a search context. In effect we are proposing multi-document contextualisation of search results. To demonstrate the validity of the approach we have develop a simple prototype. This prototype - even though based on a very simple model of context - is sufficient to provide statistically significant enhancement of the relevance of search results. We are currently carrying out further tests to quantify more precisely the level of improvement that is achievable.
To improve this approach further there are a number of directions for further work. The most significant of these is to develop an improved model of the relationship between information context and hypertextually linked pages in a web environment. This model forms the foundations of the approach being adopted and is required for developing a more rigorously supported set of searching algorithms. For example, we need to more clearly define the significance of the syntactic and semantic distance between search terms and context terms, and how to interpret link navigation in terms of semantic distance. We also need a clearer conceptualisation of what constitutes a valid context term. For example, it may be more appropriate (and intuitive) to allow the user to specify a single query, and then utilise the whole query as both search terms and contextual terms. We can also consider aspects such as the use of thesauri to provide a more comprehensive coverage of possible context terms.
It is also worth noting that at present there is a significant performance overhead associated with our approach. For example, a query that returns 200 hits, then requires between 40 seconds and 5 minutes to refine the rankings (depending upon available bandwidth, network traffic, and server response times). Although results are presented to the user incrementally, this is still an unacceptable delay. It is partially due to our use of commercial search engines, rather than maintaining our own indexes. We do, however, still need to investigate modified algorithms that provide improved performance.
We are also looking at broader evaluations to determine the specific limitations of the approach. For example, we wish to consider if there is potential for lowering the level of recall if the context of a page is misinterpreted.
Overall, the simple prototype we have constructed is sufficient to demonstrate a significant enhancement in the quality of search results that are achievable.
The author would like to thank Ross Todd, Mairead Browne and Andrew Bucknell for the valuable input provided during this research - particularly in terms of the insightful discussions related to information management and searching on the Web.
Filman and Pant, "Searching the Internet", IEEE Internet Computing, Vol 2. No. 4, July-August 1998, pp21-23
McNicholas & Todd. "New kids on the box" Scan15(4), 1996, 40-42
Lawrence and Giles, "Searching the World Wide Web", Science, Vol 280, No. 5360, 1998, p98
Selberg & Etzioni, "Multi-Service Search and Comparison using the MetaCrawler", Proc. 1995 WWW Conference, 1995
Salton, The SMART Retrieval System, Prentice-Hall, 1971
Harman, "Ranking Algorithms", in Information Retrieval Data Structures and Algorithms, Frakes and Baeza-Yates, eds. Prentice-Hall, 1992, pp 363-392
Benitez, Beigi & Chang, "Using Relevance Feedback in Content-Based Image Metasearch", IEEE Internet Computing, Vol 2 No. 4, Jul-Aug 1998, pp59-69
Lawrence & Giles, "Context and Page Analysis for Improved Web Search", IEEE Internet Computing, Vol 2 No. 4, Jul-Aug 1998, pp38-46
Li, "Toward a Qualitative Search Engine", IEEE Internet Computing, Vol 2 No. 4, Jul-Aug 1998, pp24-29
Lowe & Bucknell A "Model-Based Support for Information Contextualisation in Hypermedia", In Proceedings of MMMâ97, Fourth International Conference on Multimedia Modelling, Singapore, November 1997
Ellis, "Progress and problems in information retrieval", Library Association, 1996
David Lowe, © 2000. The author assigns to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The author also grants a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers and for the document to be published on mirrors on the World Wide Web.
[ Proceedings ]