Mingfang Wu, CSIRO Mathematics and Information Science, Melbourne, Australia mingfang.wu@cmis.csiro.au
Michael Fuller, Department of Computer Science, RMIT, Australia msf@mds.rmit.edu.au
Ross Wilkinson, CSIRO Mathematics and Information Science, Melbourne, Australia ross.wilkinson@cmis.csiro.au
Most search engines deliver the retrieved documents to the web in the form of a ranked list, which is too numerous for mental processing. In this paper, we introduce two structured methods to improve the delivery of the retrieved documents. One method is a data-driven approach, which uses the clustering technique to automatically extract topics from the retrieved documents and classify them accordingly. The other is a question-driven approach, which classifies the retrieved documents according to a set of categories derived dynamically from a users query. We believe that multiple organizations of the retrieved documents are necessary to meet various information seeking tasks.
As more and more documents can be accessed over internet, people are increasingly using the web search engines as a tool to facilitate their search for solutions to their problems. Typically, a user formulates a query to represent the information need, and send it to a search engine. The search engine then uses a matching algorithm to retrieve all documents containing at least one query string, and delivers the result in the form of a list with the documents being ranked according to their similarity to the query. It is up to the user to read through the list of all documents to identify relevant documents as an answer. The structure of this ranked list could be effective and efficient if all relevant documents are all ranked highly (say top five to ten). Unfortunately, this is not always the case. The list structure fails to facilitate the user in finding answers in many situations, such as:
In each of the above circumstances, different organisational approaches may be felicitous. For example, a table of contents-like overview or summary of the retrieved documents could help the user understand more about the retrieved documents and guide the users navigation. Redundant documents could be grouped together and represented by only a representative document; relevant documents could be pooled together and separated from irrelevant documents so that the user need not to find them one by one. As a post-search step, various structures can be applied to organise the retrieved documents. The user can thus manipulate the organisation of the retrieved documents and find suitable entry points that lead to their needed information in certain circumstance.
We proposed and tested two structured approaches to deliver the retrieved documents to the web. The list of retrieved documents are reorganized according to the proposed structures, aiming to help users better understand the set of retrieved documents and provide various entry points to the needed information. These two structures focus on the information need that involves seeking specific aspects of a topic. We refer this kind of information needs as aspect topics. The answer of an aspect topic consists of more than one related piece of information. For example, the question what non-surgical alternatives exist for treating heart disease? might have diet, exercise, meditation, and drug programs as different aspects of the answer.
Classification is the most primitive and common activities of human being, which consists of sorting like things into categories. The popular application of categories proves it to be an effective way to help people manage their information. The basic idea behind the two structured deliveries is to classify the retrieved documents. Instead of using a set of pre-defined static categories, the two structure deliveries use different approaches to capture categories dynamically. One is a date-driven approach that finds categories automatically by using a clustering method, and the other is a question-driven approach that derives categories interactively from a users expected answer.
Cluster analysis is a tool to reveal structure and relations in the data. It is noted in (Cutting and Karger 1992) that the clustering is a technology capable of topic extraction. If the extracted topics could map to potential aspects of a topic this structure will help users for their aspect seeking task. Here, the extracted topics act as categories, the retrieved documents will be classified into their closest categories.
We apply a non-hierarchical cluster algorithm to a subset of retrieved documents (300 highest ranked documents), the number of clusters is controlled between seven to ten. The cluster structure is presented by cluster descriptions (Fuller and Kaszkiel 1998). A cluster description was formed from the ten highest-weighted terms from the cluster vector, the five most frequent words pairs from all documents in the cluster, and the titles of the three documents in the cluster that were most similar to the query.
Figure 1 shows an interface to present cluster structure. The interface is divided into two panels. The left-hand panel displays the cluster descriptions, with each cluster description containing a link causing the titles of all documents in the cluster to replace the cluster descriptions. Each title links to the content of the document that is displayed in the right-hand panel.

Figure 1. The interface for cluster-based structure
While the clustering method tries to extract categories automatically only from the retrieved documents, the question-driven classification approach gets the categories interactively from users (Wu and Fuller 2000). To the aspect query, a user may not know what the exact aspects to the topic, they usually know the characteristics of the aspects they are looking for. For example, consider the topic: Which countries import sugar from Cuba? While users may not know beforehand that Russia, Latvia, or Iran are actual aspects of the answer, they usually know that all aspects should be country names. In this example, by using a set of country names (the potential aspects) to classify the retrieved documents, subjects may be able to more easily find facts of the topic.
| Figure 2. Interface for selecting appropriate categories | Figure 3. The interface for categorisation-based structure |
To get the set of categories, we extracted keywords from a query, then used WordNet (Christiane 1998) to identify a set of hyponyms for each keyword. These hyponym sets form the basis for candidate category sets. We then let users to decide which category set would be used for classification according to their focus of attention. A window (as shown in Figure 2) shows users the extracted keywords and the sample of their associated categories; users can consider the alternative categories before selecting the most appropriate. The retrieved documents would then be classified along the selected categories and presented to the users as shown in Figure 3.
The interface in Figure 3 is also divided into two panels. In the left-hand panel, the upper frame shows the document categories. Each category is expandable and collapsible; in Figure 3, the first category is shown collapsed, and the second expanded. The middle frame shows the already discovered aspects, along with the saved documents relevant to each aspect. A button in the bottom frame enables users to add new categories into which documents may be classified. When any document is selected from the upper-left or middle-left frame, its content is shown in the right-hand panel of the window. Any terms that match the currently expanded category are highlighted in red; terms that match the descriptions of other categories are highlighted in blue. This highlighting is intended to help users more easily locate potential answers from within what may be lengthy documents. When the user finds information relevant to an aspect of the topic in a document, they can click on Save Aspects button. This causes a pop-up window to appear in which the user can note the aspects to which the document is relevant. The discovered aspects and their associated documents are then added to the middle-left frame. Whereas the upper-left frame helps user to search for information that contributes to their answer, the information in the middle-left frame helps the user synthesis their answer.
To determine the useability of the two structured deliveries, user involved experiments were conducted. The experiments compared each structured delivery with a ranked list in terms of the effectiveness (how well users finish their tasks) and the user satisfaction of each interface. In one experiment, subjects task was read a cluster description and then judge whether the cluster contains documents relevant to the shown topic. We found that subjects were able to correctly determine from the cluster descriptions that which clusters likely contain relevant information, and which do not. Considering that each topic had only one or two clusters containing relevant documents, we may say that the cluster structure can give users a broad understanding of the relationship of the retrieved documents and thus help users narrow on further search.
In another two experiments, subjects were asked to do the aspect finding task, which was: find and record as many different aspects of a topic as possible within a 15 minute time limit. This task has no reward for a repeated aspect. The experiment results showed that subjects saved about the same number of aspects by using either the cluster-based interface or the list-based interface, subjects saved more aspects by using the categorisation-based interface than the list-based interface. This indicates that the categorisation-based interface may be more suitable to the aspect finding task. This may be because that the clustering can extract topics from the retrieved documents at a broad sense, but can to extract aspects straight way at a finer level.
In both experiments, subjects had strong preference of the either structured delivery than the ranked list. The user satisfaction questionnaire also showed that the organisation of retrieved documents influence subjects perception. Although the interface of the question-driven approach and the ranked list offered the same amount of information, albeit differently organized, subjects nevertheless felt that the ranked list interface showed too much information, and felt able to find neither enough information nor sufficiently precise information to answer the topics.
We demonstrated two structured ways to deliver the retrieved documents to the web. Useability testing showed that the two structures were good at different information seeking tasks. What we have learned from our experiences is that the delivery of the retrieved documents is influenced by the characteristics of the retrieved documents, users task structure, and users preference. It is arguably not the case that any single delivery mode can satisfy the wide range of information seeking tasks and user group. It is thus desirable to provide multiple organisations of the retrieved documents, so that users can manipulate the organisations and find one that is most suitable to resolve their information needs.
Christiane Fellbaum (ed.). (1998). WordNet: An electronic Lexical Database. The MIT Press, Cambridge Massachusetts.
Cutting, D. R., Karger, D. R. et al. (1992). Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. In Proceedings of the 15th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval. (pp.318-329).
Fuller, M., Kaszkiel, M. et al. (1998). TREC7 Ad Hoc, speech, and interactive tracks at MDS/CSIRO. In Proceedings of the Seventh Text Retrieval Conference (TREC-7)
Wu, M., Fuller, M., & Wilkinson, R. (2000). Question-driven approach to classification retrieved documents. In Proceeding of Australia User Interface Conference, February 2000, Canberra (pp. 134-140 ).
Mingfang Wu, Michael Fuller, and Ross Wilkinson, © 2000. The author assigns to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The author also grants a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers and for the document to be published on mirrors on the World Wide Web.
[ Proceedings ]