BOTH: Cooperative Automatic Web Navigation and Hierarchical Filtering


Michele Angelaccio, Department of "Informatica, Sistemi e Produzione", University of Rome "Tor Vergata", v. della Ricerca Scientifica 1, Rome, Italy Phone +39-6-72594478 Fax +39-6-2020519 Email: angelaccio@utovrm.it Home Page:Michele Angelaccio [HREF 1]

L. Zamburru, D. Genovese


Keywords: WorldWideWeb, Browsing Tools, Information Filtering, Cooperative Agents, Network Computing.

Introduction

The exponential growth of the World-Wide Web (shortly W3) is mainly due to the possibility of publishing the pages in an easy way on the Internet by using the Hypertext-based HTML language. However the enormous resulting hypertext graph (Web Space) makes it difficult to search for documents in absence of a uniform and efficient organization. Keeping a large amount of data well organized is difficult. In fact, the notion of "well organized" is highly subjective and personal. What one user finds clear and easy to browse may be difficult for users who have different needs or backgrounds. Therefore, besides the browsing process, some information discovery tool must be taken into account by the user.

These are of two types:

A Subject Tree is a catalogue of the Web resources organized in a browsable hierarchy of titles, each covering a topic. It can also be searched in order to speed the navigation, but user browsing remains the fundamental interaction paradigm. The number of Subject Trees is increasing and often the proposed links are other Subject Trees devoted to more specific topics. An index-based tool offers instead an up-to-date database of W3 resources that can be queried by means of a form-based interface in order to localize a certain document. Even if they can be asked by a conventional browser and the answers are put in the form of hypertext links, their purpose is to avoid the browsing process. Example of Subject Tress are Yahoo, BUBL, etc. whereas the most diffused index-based tools are Lycos, WebCrawler, Altavista, etc. Furthermore the Yahoo Catalogue makes available a searching tool ranging on its local pages with the possibility to redirect the query to other Search Tools. Unfortunately due to the rapidly increasing size of the WebSpace the use of these tools is becoming insufficient. Example of inadequacy are:

Thus the information discovery process appears like a two stage process:

  1. information retrieval (made by means of Subject Tree tools and/or index-based search tools);
  2. information filtering of the documents pointed out by the retrieved documents;

Among the proposed solutions in the field of searching tools optimization(see Bush Library, Hamline University, 1995 [HREF5] for a list of sites that discuss search tools) there are:

Fish Search [HREF2] (De Bra et al 1994) and Letizia (Lieberman 1995) are two example of tools that can be used to support the information filtering phase. Fish Search recursively exploits the links contained in each document starting from the current page and taking into account the user query for collecting relevant documents. Originally the spider program was a part of the client browser. A later form-based version tries to overcome this inefficiency by implementing the spider as a server functionality. However the main difficulty is the inability for the user to have access to the structure of the document graph. The output is a list of links similar to the index-based search tool.

Letizia is based, instead, on a browsing assistance agent that learns from the user actions and follows autonomously the links on the current page to give hints to the user during its browsing process.

We feel that the shortcomings in the above tools are mainly the inadequacy of the output shown and the inability to furnish an idea of the way in which documents are retrieved. In each case the path connecting retrieved documents is not displayed and the user is forced to navigate in order to better understand the topology of the document graph. What is expected is the capability of visualizing the documents that are explored by the spider, in an efficient way and by showing the inter-documents connections. This would help the user in the information filtering phase by discarding as soon as possible unfruitful paths.

This requires an integration of the search process and visualization process of the retrieved document.

BOTH (Browsing support Oriented To Hierarchies) is a search tool that offers:

A document-tree based user interface support for information filtering
It interfaces the user with the search engine and extracts the information view by showing a dynamic HTML page containing a tree, each of whose nodes is a link to documents retrieved by the search engine.

A cooperative distributed search engine
This implements an automatic browsing system of the relevant documents space. Its architecture can be thought of as a distributed information gathering system (Spider) that searches by starting from a specified set of links. There are "browsing agents" that navigate and cooperate for filtering the information, and are coordinated by a "master agent" that interacts with the user.

Its name derives from its property of being both a cooperative search engine that makes automatic navigation and an information filtering support via a graphical visualization of the retrieved documents graph. We remark that in this way BOTH has the capability to integrate browsing with searching paradigm.

The user can activate BOTH in the following cases:

The query submission includes the navigation parameters definition of the relevant documents graph characteristics. The BOTH spider starts the search for the documents in the WebSpace and stores them on a local cache (mirror of the documents graph)

To guarantee an efficacious visualization of the documents in the mirror and to allow an efficient support for the filtering, the user interface is implemented by means of a dedicated HTML page (HotView) that shows a document-links tree dynamically updated. The HotView page helps the user in the information filtering phase by selecting the relevant pages on the most promising paths of the tree. This has been achieved by a form-based mechanism embedded into the hypertextual tree that allows the selection of one or more tree nodes.

After selecting the most interesting pages, the user can submit either a new query on the interest graph that starts from the selected pages, or a request for the concatenation of the selected pages.

Section 2 gives a description of the user interface with an example. Section 3 shows the architecture and Section 4 discusses some applications.


The Information Filtering Model

In order to explain the characteristics of the BOTH Interface, we give a better description of the process of searching for relevant documents. We assume that this process is described by an Information Filtering model having the following characteristics:

To clarify with an example, we consider the following query (Q): "Michelangelo OR Paintings" that aims to search for documents describing the paintings of Michelangelo Buonarroti.

The RDG is denoted by:

The main difficulties concerning this process are:

wrong paths
In some cases irrelevant pages are retrieved and examined. Typically this is a source of inefficiency due to time for downloading files and the increasing network traffic. For example a path could be Arts-> Renaissance-> {Raffaello, Leonardo,..} where leaves do not contain information about Michelangelo.
multiple source of information (delocalization)
There are cases for which the information is distributed on different documents. For instance, there could be various documents each dealing with the ancient paintings hosted in a given museum where Michelangelo is mentioned. In absence of a unique reference, the user is forced to navigate among different pages.

To overcome these difficulties, BOTH furnishes a set of features in support of information filtering phase (filtering functions):

The user takes advantage of the compactness given by the Hotview to select the most relevant pages. We show a sketched Hotview that would be obtained from the submission to BOTH of the query Q introduced above. This tree structure shows the pages that are stored in the mirror by the Spider and preserves the search order. For example, one possible most relevant page would be the one corresponding to the link 1.3.

A key role among the filtering functions is played by the document references visualization carried out by the HotView. The main characteristics can be pointed out as follows (HotView features):

hypertext topology
The hypertext organization of the RDG has been shown in hierarchical way. Each node of the tree displayed in the HotView is represented by using the following node-items:
document references
they are the title of the page, and the hot link to the original document in W3;
filtering support items
they include a score number, evaluated on the basis of the number of keyword matches, and an abstract, obtained by concatenating a set of phrases each containing at least one keyword. The advantage is that these features are now combined with the capability to browse through a tree instead of a list of links. For instance, if two documents have the same score number, the one having a better score number sequence taken from the root could be more promising.
redundancy
The WebSpace is a Direct Graph that may contain cycles like the one corresponding to back links. BOTH outputs a tree that is the result from visiting the RDG characterized by a mechanism of cycle avoiding. This corresponds to a spanning tree algorithm except for the nodes that have indegree grater than one. These nodes will be replicated for each incoming arc to better evidence all the possible browsing paths that reach that node.
dynamic visualization
the HTML code of the HotView page is updated during search process . When a new tree-level is available in the local cache, the corresponding node-items are added to the HotView. When the Spider checks the outgoing links of one page, a blinking icon is displayed near the page title.
page selection made on the tree
By applying the HTML form syntax directly to the code corresponding to the tree, we are able to filter relevant pages by allowing an easy way to extract the corresponding subgraph from the displayed tree. This has been accomplished by means of a check-box for each node. When the user selects one check box, the corresponding page will be referred by using local cache address instead of the original URL .
page merging request
A click on the retrieve button activates the merging of the local files corresponding to the selected pages. The result will be sent to the remote W3 client (browser) and it provides an easy way to give a compact view of two or more pages dealing with the same arguments. In this way the merge of documents can be obtained without navigating on the tree and by avoiding an explicit local saving of each page.

Figures 1...3 show a typical search session.


Select starting page(s) for Automatic Navigation.

Automatic Navigation query:


Automatic Navigation parameters

Depth of search: Width of search:

Timeout after (minutes):

Node limit in expanding tree:

Fig.1 - Search form

Figure 1 shows the query form accessible via World-Wide Web. There are:

Figure 2 is an example of HotView where some nodes are selected; figure 3 corresponds to the so called result page obtained merging the selected documents. The result page contains a query form in which the user can choose among the selected pages the starting points for a new refinement search.

In the next section we give a brief description of the BOTH components and the way how search engine is organized.

The software architecture

Recent advances in index-based searching (Web Ants [HREF3] ) and disconnected browsing (WebIQ [HREF4] ), have pointed out the importance of a distributed implementation of the information gathering system. This is strengthened in BOTH because tree updating requires to be interleaved with search engine in order to do hierarchical visualization as fast as possible.

We have chosen the following master/slave architecture for the search engine:

The coordinator (HVC -HotView Coordinator) and the micro-browsers (MBW) are allocated on an heterogeneous network of PC and workstations and run on a wide set of UNIX-like platforms. The implementation of the master-slave communications has been done by means of a dedicated C communication library using the TCP/IP protocol suite.

Each MBW receives a retrieval request from HVC and activates the concurrent Network Agents whose function is to contact the remote HTTP servers and to download the requested documents. To guarantee a low traffic density on the TCP IP network (Wide Area Network) the files retrieved by the Network Agents are stored on a distributed disk cache defined on the LAN. MBWs also calculate score and determine abstract. The HVC, instead, accepts incoming search requests and controls slaves' activity. A tree structure containing informations about the retrieved pages is created and updated for each active user. Periodically, HVC uses these structures to updates HTML code of the HotVew. BOTH keeps the HotView file in a directory accessible via World-Wide Web during the search. Therefore the user can sometime check the search progress and use the partial results.

BOTH is a multiuser system, although not designed for a free access to the entire Web community. All the HotView pages (each user has a unique identifier) are stored together, and a page is removed after a fixed period of time, in order to allow free search sessions: the user submit query and reconnect later to see results. In this way, the normal browsing activity is not interrupted.

The construction of each one document tree T displayed in the HotView Page proceeds as follows:

  1. (level T(0)) The root is a node specified by user;
  2. (level T(i+1), i < d) For each node n' in T(i), a maximum of w external links are followed by the Spider except for the case of n' being already visited (weak cycle avoiding property). All the retrieved documents matching the query are linked to n' in the next level of T.
All nodes in T correspond to a document matching the query. The parameters d and w are specified in the query form. A search time- out and an overall expansion limit can also be settled.

Fig. 4 - System architecture.


The BOTH overall organization is illustrated in figure 4. There is described the Spider components and a gateway component. This latter process parses the CGI Form data incoming from the HTTP server by assuming the following three user request types:
  1. Query request that contains the list of the initial URLs, the query, all navigational parameters (maximal depth and maximal outdegree of the retrieved documents graph) and time- out settings;
  2. Result page request that contains a list of document reference corresponding to the HotView selection;
  3. Query refinement request, similar to the query request but specified in the result page (the initial pages are among those selected in the HotView).
The coordinator receives from the Gateway process the message corresponding to the user request and schedules each micro-browser in accord to a given policy. After starting the URL loading, each micro-browser filters the arriving text by applying the search keyword and returns to the coordinator a list of the next URLs to be loaded. Concurrently, the HVC creates the HotView page and, in cooperation with the micro- browser, decides which URL must be loaded next.

The inefficiency due to high network load is mitigated by introducing a local cache that avoids network access for pages already retrieved by the Spider.


Applications

The combination of the Spider and the hierarchical visualization of the retrieved pages offered by BOTH augments the capability to filter a relevant document graph. Therefore BOTH can be used in all cases for which the user, after an information retrieval phase has too much answers that must be explored. In particular, among all the possible benefits provided by BOTH there are:
  1. a way to search for a page reachable by using a a Subject tree in absence of a search tool provided by the corresponding server;
  2. a compact view of a Web relevant document graph;
  3. a way to avoid unnecessary page downloading during browsing process;
  4. the capability to merge related information located in different pages.
To illustrate in particular the fourth benefit, we show in figures 5 and 6 another example of BOTH use. The goal is to create, starting from a list of newspaper sites, an HTML page showing how news treating a same argument have been reported by different newspaper. BOTH merges the original text extracted from all different newspapers sites (se the Hotview in Fig.5) in the Result page (Fig.6). Hence the user has a compact view of different journalistic point of view without browsing through each newspaper site.

.

Fig 5 - An HotView about italian news .

Fig 6 - The Result page merging the information distributed in the HotView selection displayed in fig. 5.

Conclusion

Among the search tools that are faced to help the user browsing, BOTH may be considered a first step towards a solution for the Information Discovery problem on the Web without masking the hypertextual structure. Its main property is to combine automatic navigation on a well defined document graph with hierarchical synthesis of the document graph being explored. This allows a fast selection of the most promising links and an easy way to have a compact view of information spread over the retrieved documents. Among the open issues there are:


References

Lieberman, H. (1994) Letizia: An Agent That Assists WebBrowsing Proceedings of AAAI 95 - AI Applications in Knowledge Navigation and Retrieval - MIT Cambridge MA, USA, 1995, pp. 97 - 103

DeBra, P., Houben, G. J., Kornatzky, Y. (1994) Navigational Search in the WorldWideWeb URL: ftp://ftp.win.tue.nl/pub/infosystems/www/Mosaic-2.1-fish-short-paper.tar.gz


Hypertext References

HREF1
http://russell.ce.utovrm.it/home/angelac/index.html - Home Page:Michele Angelaccio
HREF2
http://www.win.tue.nl:82/win/cs/is/debra/wwwf94/article.html - DeBra,et al. HTML paper: "Search for arbitrary information in the WWW:the fish search for Mosaic".
HREF3
http://agent2.lycos.com:8001/web -WebAnts Progress Report.
HREF4
http://www.cs.purdue.edu/homes/wittmam -WEBIQ Design Report
HREF5
http://www.hamline.edu/library/links/comparisons.html - Web Search Tools comparison from Bush Library, Hamline University, 1995

Copyright

Michele Angelaccio, Lorenzo Zamburru, Domenico Genovese ©, 1996. The authors assigns to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The authors also grant a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers, and for the document to be published on mirrors on the World Wide Web. Any other usage is prohibited without the express permission of the author.
Pointers to Abstract and Conference Presentation
Abstract Conference Presentation Interactive Version Papers & posters in this theme All Papers & posters AusWeb96 Home Page

AusWeb96 The Second Australian WorldWideWeb Conference "ausweb96@scu.edu.au"